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Abstract 

The  proliferation  of  simple  and  low-cost  devices,  such  as  IEEE  802.15.4 
“ZigBee”  and  Z-Wave,  in  Criticallnfrastructure  (Cl)  increases  security  concerns.  Radio 
Frequency  “Distinct  Native  Attribute”  (RF-DNA)  Fingerprinting  facilitates  biometric¬ 
like  identification  of  electronic  devices  emissions  from  variances  in  device  hardware. 
Developing  reliable  classifier  models  using  RF-DNA  fingerprints  is  thus  important  for 
device  discrimination  to  enable  reliable  Device  Classification  (a  one-to-many  looks 
“most  like”  assessment)  and  Device  ID  Verification  (a  one-to-one  looks  “how  much  like” 
assessment).  AFIT’s  prior  RF-DNA  work  focused  on  Multiple  Discriminant 
Analysis/Maximum  Likelihood  (MDA/ML)  and  Generalized  Relevance  Learning  Vector 
Quantized  Improved  (GRLVQI)  classifiers.  This  work  1)  introduces  a  new  GRLVQI- 
Distance  (GRLVQI-D)  classifier  that  extends  prior  GRLVQI  work  by  supporting 
alternative  distance  measures,  2)  formalizes  a  framework  for  selecting  competing 
distance  measures  for  GRLVQI-D,  3)  introducing  response  surface  methods  for 
optimizing  GRLVQI  and  GRLVQI-D  algorithm  settings,  4)  develops  an  MDA-based 
Loadings  Fusion  (MLF)  Dimensional  Reduction  Analysis  (DRA)  method  for  improved 
classifier-based  feature  selection,  5)  introduces  the  F-test  as  a  DRA  method  for  RF-DNA 
fingerprints,  6)  provides  a  phenomenological  understanding  of  test  statistics  and  p-values, 
with  KS-test  and  F-test  statistic  values  being  superior  to  p-values  for  DRA,  and  7) 
introduces  quantitative  dimensionality  assessment  methods  for  DRA  subset  selection. 


IV 


The  optimized  GRLVQI  algorithm  and  the  proposed  GRLVQI-D  algorithm  show 
improved  perfonnance  over  the  baseline  GRLVQI  algorithm.  When  considering  the 
optimized  GRLVQI  and  GRLVQI-D  classifiers  using  AQ  =  189  Z-Wave  features  and  an 
arbitrary  average  correct  classification  ( %C )  of  %C  =  90%  benchmark,  demonstrated 
Device  Classification  SNR  gain  ( Gsnr )  performance  relative  to  baseline  GRLVQI 
includes  1)  improved  Gsnr  =  +1.84  dB  using  GRLVQI-D  with  a  Cosine  distance 
measure,  and  2)  best  case  Gsnr  =  +1.94  dB  using  the  GRVLQI  optimized  algorithm.  For 
Z-Wave  Device  ID  Verification,  results  of  included  correct  verification  of  authorized 
device  IDs  (%J 7f)  include  1)  worst  case  %Va  =  33.33%  for  baseline  GRLVQI, 
2)  improved  %oV a  =  66.66%  for  GRLVQI-D  using  a  Cosine  distance  measure,  and  3)  best 
case  %>V a  =  100%  using  the  optimized  GRLVQI  algorithm. 

The  proposed  F-test  and  MLF  DRA  methods  are  shown  to  offer  distinct 
perfonnance  improvements.  ZigBee  Device  Classification  results  for  selected  DRA 
methods  with  an  MDA/ML  classifier  benchmark  of  %C  =  90%,  included  SNR  gain 
relative  to  the  benchmark  GRLVQI  DRA  with  Ndra  =  50  feature  sets  of 
1)  Gsnr  =  +0.82  dB  for  MLF  DRA,  and  2)  Gsnr  =  +0.10  dB  for  F-test  DRA  using 
Ndra  =  50.  ZigBee  Device  ID  Verification  results,  using  the  same  Ndra  =  50  feature  sets 
and  MDA/ML  classifier,  included  correct  %oV a  and  correct  detection  of  unauthorized 
rogue  device  IDs  (%Vr)  of  % VA  =  50%  and  %Vr  =  91.67%  for  the  benchmark  GRLVQI 
DRA,  with  1)  comparable  %oV a  =  50%  and  %Vr  =  91.67%  for  MLF  DRA,  and  2)  best 
case  %>V a  =  75%  and  %VR  =  91.67%  for  F-test  DRA. 
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I.  Introduction 


But  in  war,  as  in  life  generally,  all  parts  of  the  whole  are  in  terconnected  and  thus  the 
effects  produced,  however  small  their  cause,  must  influence  all  subsequent  military 

operations ... 

-Carl  von  Clausewitz,  1780  -  1831 

Communication  networks  permeate  society  through  commercial  networks,  such 
as  the  internet,  cell  phones  and  Wi-Fi,  to  Industrial  Control  Systems  (ICS),  such  as 
Supervisory  Control  And  Data  Acquisition  (SCADA)  systems,  which  monitor  and 
control  many  critical  infrastructure  (Cl)  systems.  In  all  communication  networks,  one  is 
interested  in  a  balance  between  attributes  such  as  perfonnance,  security,  reliability, 
availability,  and  survivability  [1,  2].  In  Cl  applications,  all  of  these  attributes  are 
necessary  since  Cl  interruption  can  threaten  lives,  disable  governments,  affect  the 
economy,  and  damage  ecological  systems  [3],  Additionally,  the  “fog  of  war”  has  been 
reduced  due  to  advances  in  digital  communications  [4];  however,  security  concerns  can 
both  limit  user  confidence  in  communications  networks  [5]  and  reduce  this  functionality 
[4]- 

Security  is  a  critical  component  in  communication  networks  and,  due  to 
functional  interconnectedness,  compromising  one  point  can  compromise  overall  system 
security  [6].  Therefore,  the  security  of  communication  and  industrial  networks  and 
devices  is  of  high  importance  to  the  Department  of  Defense  [7-9].  Various  issues  exist  in 
securing  hardware  [10],  including:  1)  identifying  counterfeited  or  reused  components 
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[11-17],  2)  determining  claimed  device  identities  [18,  19],  and  3)  detennining  aging 
effects  [20-34]. 

Improving  methods  for  vetting  communication  device  identity  by  examining  and 
characterizing  device  physical  properties  are  of  interest.  AFIT’s  Radio  Frequency  (RF) 
Fingerprinting  process,  RF  Distinct  Native  Attribute  (RF-DNA)  Fingerprinting  [19],  is  a 
systematic  and  proven  method  for  extracting  statistical  features  from  waveform  data.  Of 
interest  in  this  research  was  the  extension  and  improvement  of  RF-DNA  practices  for 
improved  communication  device  identification  and  security. 

1,1  Operational  Motivation 

The  “Internet  of  Things”  is  predicted  to  enable  wide  connectivity  between 
commercial,  industrial  and  consumer  devices  [35].  However,  such  connectivity  includes 
many  risks  due  to  the  possibility  of  hackers  disrupting  services,  stealing  infonnation,  or 
taking  control  of  various  devices  in  Cl  applications  or  consumer  use  [35,  36].  Facilitating 
the  “Internet  of  Things”  is  the  proliferation  of  low  cost  networks,  such  as  those  created  by 
IEEE  802.15.4  “ZigBee”  and  Z-Wave  devices,  into  Cl  applications  present  numerous 
security  issues  [37,  38], 

Both  ZigBee  and  Z-Wave  devices  have  numerous  operating  advantages  that 
motivate  their  use  in  Cl  applications,  such  as  the  ability  to  communicate  up  to  100  meters 
and  the  ability  to  sustain  networks  comprised  of  up  to  65,000  devices  [39].  Given  these 
advantages,  ZigBee  devices  are  believed  to  provide  interconnections  between  more 
physical  devices  in  the  world  than  any  other  wireless  technology  [37].  Cl  networks 
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frequently  include  many  low  cost  communication  devices,  such  as  ZigBee  and  Z-Wave, 
for  interaction  with  physical  objects,  e.g.  power  relays  [40-43],  patient  monitoring 
devices  [44],  security  systems  [45],  automation  and  control  systems  [46],  home 
automation  [47],  and  electric  metering  [48]. 

Due  to  the  ubiquity  of  ZigBee  and  Z-Wave  devices,  general  security  concerns 
exists  because  a  single  fraudulent  or  hacked  network  device  can  compromise  overall 
network  security  [49]  and  the  amount  of  interconnectivity  with  ZigBee  and  Z-Wave  raise 
concerns  given  their  inherent  security  risks  [37].  Thus,  vetting  communication  device 
identity  is  critical  to  overall  security.  Regular  operations  of  a  typical  communication 
network  experiences  many  devices  requesting  network  access.  Passwords  and  keys 
required  to  gain  access  can  be  shared  or  forged,  however  the  physical  properties  of  a 
given  device  are  inherently  harder  to  forge. 

Reliable  network  security  involves  considering  multiple  layers  of  access  and 
interfacing  between  components  and  users.  Devices,  their  operations,  and  applications 
for  networks  can  be  characterized  by  the  seven  layer  Open  System  Interconnection  (OSI) 
model,  Table  1-1.  As  one  progresses  from  the  Physical  (PHY)  layer  to  the  Application 
layer,  an  increasing  number  of  trust  assumptions  are  made  [50].  Historically,  security  has 
not  adequately  considered  the  physical  attributes  of  devices  themselves.  Rather,  much 
emphasis  and  research  on  network  security  and  unauthorized  access  detection  occurs  at 
the  Application,  Network  and  Data  Link  layers  [51-60],  and  Application  Layer  [61]. 
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Table  1-1:  OSI  Model,  adapted  from  [62-65]. 


Data 

Layer 

Description 

Example 

Application 

Process  to  access  network 

End  User 

Host 

Data 

Presentation 

Formats  data  for  application 
layer,  and  encrypts  data 

Syntax,  data 
manipulation 

Layers 

Session 

Interhost  connections, 
session  establishment 

Synching 

Segments 

Transport 

End-to-end  connections 

TCP,  host-to-host 

Packets 

Network 

Controls  subnet,  decides 
physical  path  for  data,  IP 

Packets,  routing 

Frames 

Data  Link 

Transfer  of  data  between 

Frames,  MAC 

Media 

nodes  over  physical  devices 

addresses 

Layers 

Physical 

Transmission  and  reception 
of  media,  signal;  physical 
devices. 

Cables,  devices, 
physical  mediums, 
transmission 
methods 

PHY  features  are  considered  as  an  additional  level  of  security  for  more  robust 
security  systems  and  rogue  device  authentication  [19].  For  improved  security  and 
monitoring  of  device  operations,  it  is  desirable  to  collect  and  monitor  identifiable  features 
possessing  qualities  of  universality,  distinctiveness,  permanence,  and  collectability  [18, 
66].  Moreover,  these  feature  qualities  are  akin  to  biometric  features  [67-70].  AFIT’s 
RF-DNA  Fingerprinting  is  one  proven  method  for  exploiting  biometric-like  features  of 
electronic  devices  and  was  therefore  of  interest  for  this  research. 


1.2  Radio  Frequency  Fingerprinting 

Broadly,  there  are  two  PHY-layer  based  security  approaches  that  have  been 

applied:  l)the  addition  of  physically  traceable  objects  to  devices  [71-73],  and  2)  the 

exploitation  of  inherent  and  unique  features  in  device  signals  through  RF 
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Fingerprinting  [18,  74-76].  A  variety  of  research  has  been  conducted  in  the  area  of  RF 
Fingerprinting  -  c.f.  [49,  51,  66,  75-118],  but  each  generally  follows  a  similar  procedure 
whereby  fingerprint  features  are  extracted  from  device  emissions.  In  general,  RF 
Fingerprinting  processes  involve  1)  selecting  Regions  of  Interest  (ROIs)  within  a  given 
signal  response,  2)  computing  features  from  each  ROI,  3)  computing  fingerprints  from 
each  feature,  and  4)  training  classifier  models  to  discriminant  on  these  features  [102],  RF 
Fingerprinting  research  has  considered  various  wireless  communication  devices, 
including  IEEE  802.11  (Wi-Fi)  [92,  96,  97,  106,  119,  120],  IEEE  802.16  (WiMAX)  [98], 
802.15.4  (ZigBee)  [49,  89,  91,  113,  121,  122],  Z-Wave  [49,  123],  Satellite 
Communication  (SatCom)  [124],  Global  System  for  Mobile  Communications  (GSM) 
cellular  phones  [101,  125],  IEEE  802.15  Bluetooth  [86],  Ethernet  [77,  126,  127],  and 
Radio  Frequency  Identification  (RFID)  [78,  109]. 

Of  specific  interest  in  this  research  was  the  RF-DNA  Fingerprinting  method  as 
codified  by  Cobb  et  al.  [18,  19]  and  extended  by  work  in  [74],  As  adopted  here,  the  RF- 
DNA  Fingerprinting  process  considered  statistical  features  computed  in  each  ROI  of  the 
instantaneous  amplitude,  frequency  and  phase  responses  [18].  RF-DNA  has  been 
employed  in  many  applications  [18,  19,  49,  74,  89-93,  97-99,  101,  113,  121,  128]  and 
shown  efficacy  for  both  cross-model  (different  manufacturers)  [101]  and  like-model 
(same  manufacturer,  same  model,  different  serial  number)  device  discrimination  [92], 

RF-DNA  Fingerprinting  embodies  Wittgenstein’s  [129]  proposition  that  “in  order 

to  know  an  object,  I  must  know  not  its  external  but  all  its  internal  qualities,”  by 

augmenting  the  current  external  security  measures  via  characterizing  the  internal 
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qualities.  However,  it  should  be  stated  that  any  measurements  are  model-based 
observations  of  the  real  phenomena  [130],  or  as  Heisenberg  stated  [131],  "We  have  to 
remember  that  what  we  observe  is  not  nature  herself,  but  nature  exposed  to  our  method  of 
questioning."  Thus,  RF-DNA  Fingerprinting  provides  a  reflection  of  the  operating 
condition  of  RF  devices,  which  has  been  further  explored  by  directly  analyzing  integrated 
circuits  (ICs)  in  [104]. 

1.3  Technical  Motivation 

RF  Fingerprinting  research  has  primarily  focused  on  applications  [49,  74,  78,  86, 
89,  91,  92,  96-98,  106,  109,  113,  119,  121,  125]  with  classifier  model  development  [19, 
51,  91,  92,  132]  and  Dimensional  Reduction  Analysis  (DRA)  [49,  89,  113,  132]  as 
secondary  objectives.  AFIT’s  RF-DNA  work  has  previously  considered  four 
classification  methods:  Multiple  Discriminant  Analysis/Maximum  Likelihood 
(MDA/ML)  [90],  Generalized  Relevance  Learning  Vector  Quantized- Improved 
(GRLVQI)  [51],  Learning  from  Signals  (LFS)  [133],  and  Decision  Trees/Random  Forests 
[134].  Additionally,  since  RF-DNA  generally  considers  many  fingerprint  features,  e.g. 
Np  =  729  features  for  the  ZigBee  dataset  of  [91],  DRA  has  been  of  interest  to  select 
relevant  subsets  of  features. 

Various  unresolved  issues  exist  in  RF  Fingerprinting  research  and  herein 
extensions  are  made  to  the  RF-DNA  process  itself,  classifier  development,  and  DRA 
methods.  Three  previously  unresolved  issues  related  to  DRA  for  RF  Fingerprinting  are 
addressed  in  Chapter  IV:  1)  understanding  the  appropriate  use  of  p-values  and  test 
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statistic  values  when  using  distribution  based  DRA  methods  [49];  2)  developing  MDA 
classifier-based  DRA  methods  [135],  which  were  previously  dismissed  [51,  89,  91,  92, 
113,  134];  and  3)  the  development  of  quantitative  dimensionality  assessment  methods  to 
detennine  the  number  of  features  to  consider  [49,  135].  Recent  RF-DNA  efforts  have 
considered  a  GRLVQI  classifier,  e.g.  [51,  92,  100];  Chapter  V  addresses  three  general 
issues  in  GRLVQI:  1)  extending  the  algorithm  to  consider  non-Euclidean  distance 
measures;  2)  detennining  optimal  algorithm  parameter  settings;  and  3)  creating  a 
generalizable  derivative  skeleton  to  support  algorithm  improvements.  Although  the  RF- 
DNA  process  is  mature  and  proven,  slight  improvements  to  its  operation  are  proposed  in 
Chapter  VI  by  leveraging  techniques  in  Simulation  research  [136];  therefore,  an 
autocorrelation  based  automation  approach  for  selecting  the  number  of  ROI  sub-regions 
is  introduced. 

1.4  Research  Contribution 

Table  1-2  provides  a  summary  and  mapping  of  the  contributions  in  this  research, 
“Current  Research,”  to  previous  related  research,  “Prior  Work.”  In  Table  1-2,  the  x 
symbol  indicates  that  a  technical  area  was  addressed. 
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Table  1-2:  Relational  mapping  between  technical  contributions  in  previous  related 
work  and  current  research  contributions.  The  x  symbol  denotes  areas  addressed. 

Technical  Area  Prior  Work  Current  Research 


Addressed 

Ref# 

Addressed 

Ref# 

ZigBee 

X 

[89,91,  113,  121,  122] 

X 

[49,  135,  137] 

Z-Wave 

X 

[49,  137] 

Classification/Verification  Processes 

MDA/ML 

X 

[18,  19,  89-91,97,  101, 
105, 113] 

X 

[49,  135] 

GRLVQI 

X 

[51,92,  97,  100,  128] 

X 

[49,  137] 

LFS 

X 

[88,92,  93,94,  119, 
133] 

Random  Forests 

X 

[126] 

Dimensionality  Reduction  Analysis  (DRA) 

MDA/ML 

X 

[18,  19,51,  89-92,  113, 
121] 

X 

[49,  135] 

GRLVQI 

X 

[51,92,  100] 

X 

[49,  135,  137] 

LFS 

X 

[88,92,  133] 

Random  Forests 

X 

[132] 

KS-Test 

X 

[89,91,  113,  121] 

X 

[49,  135] 

F-Test 

X 

[49,  135] 

Qualitative 

Dimensionality 

Assessment 

X 

[89,91,  113,  121,  132] 

X 

[49,  135] 

Quantitative 

Dimensionality 

Assessment 

X 

[49,  135] 

1.5  Document  Organization 

This  dissertation  is  subsequently  organized  as  follows:  Chapter  II  presents 
background  literature  on  PHY  layer  device  identification,  RF  signals,  RF-DNA,  the 
ZigBee  devices  under  analysis,  data  collection,  and  pattern  recognition.  Chapter  III 
presents  the  baseline  classifier  methods  used  in  this  study:  MDA  and  GRLVQI.  Chapter 
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IV  reviews  and  develops  DRA  methods  for  application  to  RF-DNA.  Chapter  V  presents 
improvements  and  modifications  to  GRLVQI,  including  a  derivative  framework  to 
incorporate  non-Euclidean  distance  measures  and  an  optimization  to  method  to  detennine 
algorithm  parameter  settings.  Chapter  VI  presents  concepts  from  simulation  studies 
research  and  considers  extensions  to  the  RF-DNA  process.  Chapter  VII  concludes  the 
dissertation.  Appendices  A  through  M,  which  provide  additional  results  supporting 
concepts  and  conclusions  in  this  dissertation,  are  provided  following  Chapter  VII. 
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II.  Background 


Research  has  been  proceeding  to  develop  a  line  of  ...products  that  establishes  new 

standards  for  quality,  technological  leadership,  and  operating  excellence. 

-Michael  Kraft 

This  chapter  provides  the  foundation  for  understanding  physical  (PHY)  layer 
security  of  communication  devices,  Radio  Frequency  Distinct  Native  Attribute  (RF- 
DNA)  Fingerprinting,  ZigBee  and  Z-Wave  signals  under  analysis,  and  particulars  of 
signal  collection  and  RF-DNA  feature  extraction. 

2.1  Introduction 

This  chapter  is  organized  as  follows.  First,  a  general  discussion  on  wireless 
networks  and  a  specific  discussion  on  ZigBee  and  Z-Wave  devices  are  presented  in 
Section  2.2.  Then  a  discussion  on  PHY  security  and  device  identification  is  presented  in 
Section  2.3.  Finally,  the  RF-DNA  Fingerprinting  process  is  presented  and  discussed  in 
Section  2.4. 

2.2  Signals  of  Interest:  Wireless  Networks 

Figure  II- 1  presents  a  conceptualization  of  basic  digital  communication  occurring 

between  two  devices  [64,  138],  In  operation,  a  software  application  initiates  the 

communication  of  a  data  packet,  as  the  packet  proceeds  through  each  layer  of  the  Open 

Systems  Interconnection  (OSI)  model  more  information  in  the  form  of  headers,  addresses 

and  etc.,  are  added  at  each  layer  regarding  the  device  properties,  bit-level  identity, 

communication  properties,  data  handling  information,  and  etc.  [138].  After  passing 
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through  the  OSI  layers,  the  digitally  formatted  signal  is  transmitted  over  some  medium 
(wired  or  wireless)  and  received  by  another  device.  The  receiving  device  collects  the 
signal  and  reverses  the  digital  formatting  process,  including  the  removal  of  headers  at 
each  layer  to  detennine  how  to  handle  the  received  data  [138]. 

Transmit  Receive 


Data 


Data 


Application 

Relevant  data 
removed  at 
each  layer 


Addresses,  headers 
and  other  data 
added  at  each  layer 


Communication  Network 


Figure  II-l:  General  operations  of  digital  communication,  adapted  from  [64, 138], 

Various  technical  standards  exist  that  govern  the  operation  of  a  wide  variety  of 
communication  networks.  Of  interest  herein  are  ZigBee  wireless  personal  area  networks 
(WPAN)  which  are  governed  by  the  WPAN  working  group  (IEEE  802.15);  one  of  25 
IEEE  802  standard  subgroups  for  area  networks  [139].  The  IEEE  802.15  working  group 
also  includes  Bluetooth  (IEEE  802.15.1),  coexistence  (IEEE  802.15.2),  high  rate  WPANs 
(IEEE  802.15.3),  the  low  rate  WPANs  (IEEE  802.15.4),  mesh  networking  (IEEE 
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802.15.5),  body  area  networks  (IEEE  802.15.6),  and  visible  light  communication  (IEEE 
802.15.7)  [139,  140].  Due  to  their  operating  characteristics,  ZigBee  devices  fall  under 
the  IEEE  802.15.4  subgroup. 

2.2.1  IEEE  802.15.4  ZigBee  Devices 

ZigBee  devices  are  low-cost,  low-data  rate,  low-complexity  wireless 
communication  devices  which  can  function  at  nominal  ranges  of  10-100  meters  and 
support  networks  containing  up  to  65,000  devices  [38,  39,  141].  Given  these  attributes, 
ZigBee  devices  are  employed  for  various  tasks  and  are  consequently  connected  to  more 
devices  in  the  physical  world  than  any  other  wireless  technologies  [37,  38].  Various 
ZigBee  applications  include  maritime  environments  [142],  smart  thermostats  [37], 
electronic  door  locks  (e.g.  Kwikset  SmartCode)  [37]  and  security  devices  [143], 
smartphone  controlled  doorbells  [144,  145],  building  automation  and  control  [37,  46, 
146],  greenhouse  monitoring  [147,  148],  healthcare  [149,  150],  energy  management 
[151-153],  HVAC  (heating,  ventilation,  and  air  conditioning)  operations  [143],  smart 
metering  [154-156],  electricity  theft  detection  [48,  157],  smart  homes  and  smart 
appliances  [158,  159],  waste-water  management  [160],  chemical  plant  automation  [161], 
electric  substation  automation  [162],  and  meter  reading  [163],  Many  of  these 
applications  are  in  areas  considered  ‘critical  infrastructure  (Cl),’  the  interruption  of  which 
can  threaten  lives,  disable  governments,  affect  economies,  and  damage  ecological 
systems  [3].  Due  to  the  functional  interconnectedness  of  such  complex  systems,  a 
compromise  at  one  point  can  compromise  the  overall  system  security  [6]. 
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ZigBee  network  security  frequently  incorporates  a  128-bit  advanced  encryption 
standard  (AES),  16-bit  cyclic  redundancy  check  (CRC)  for  data  protection,  and  cipher 
block  chaining  message  authentication  code  (CBC-MAC)  for  authentication  [38]. 
However,  despite  their  near  ubiquity  and  security  precautions,  ZigBee  networks  are 
vulnerable  to  intrusion  through  readily  available  ‘hacking  tools’  such  as  KillerBee  [37]  or 
Packet-in-Packet  approaches  [164],  Unfortunately,  current  ZigBee  security  mechanisms 
frequently  neglect  the  PHY  layer  where  much  malicious  activity  occurs  [51].  PHY  layer 
protection  involves  device  identification  and  authentication;  various  reasons  exist  for 
examining  this  layer,  including  access  control,  augmenting  other  security  measures, 
authentication,  intrusion  detection,  malfunction  detection,  and  rogue  access,  among  other 
applications  [66,  165,  166]. 

When  considering  ZigBee  devices  as  an  RF-DNA  problem,  knowledge  of  the 
underlying  standard,  IEEE  802.15.4  [121,  167],  is  important  in  order  to  determine  how 
and  with  what  signal  to  create  RF-DNA  fingerprints.  IEEE  802.15.4  has  defined  PHY, 
Media  Access  Control  (MAC),  and  Network  (NWK)  layer  specifications.  In  the 
operation  of  transmitting  a  burst  signal,  a  ZigBee  device  transmission  at  the  PHY  layer 
involves  a  structure,  termed  a  PHY  Protocol  Data  Unit  (PPDU);  the  PPDU  contains  a 
defined  Synchronization  Header  Response  (SHR),  a  8-bit  PHY  Header  Response  (PHR), 
in  addition  to  a  variable  length  ‘payload’  contained  in  the  PHY  Service  Data  Unit 
(PSDU)  which  consists  of  a  MAC  sublayer  frame  [91].  The  underlying  ZigBee  PHY 
layer  packet  structure  is  conceptualized  in  Figure  II-2. 
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Figure  II-2:  ZigBee  PHY  layer  packet  structure,  adapted  from  [167]. 

Different  ZigBee  device  formats  exist  and  the  SHR  varies  in  length  and  duration 
for  different  ZigBee  PHY  options,  i.e.  frequency  (868MHz  to  2.4GHz),  and  shift  keying 
approach  [168].  ZigBee  devices  can  employ  amplitude  shift  keying  (ASK),  binary  phase 
shift  keying  (BPSK),  or  quadrature  phase  shift  keying  (QPSK),  as  seen  in  Table  3.4  of 
[168].  However,  while  the  format  of  each  region  changes  per  keying  method,  the  use  of 
each  region  is  consistent  across  ZigBee  devices:  the  preamble  is  used  for  synchronization 
between  devices,  and  the  SFD  region  used  to  indicate  the  end  of  the  SHR  and  the  start  of 


the  PHR  [168], 


Of  specific  interest  herein  are  Texas  Instruments  CC2420  2.4GHz  ZigBee  devices 
which  employ  QPSK,  [91].  These  devices  have  a  defined  128ps  duration  preamble  of  4 
octets  (4-bytes)  which  contain  8  zeros  each,  and  a  1  octet  (1-byte)  defined  SFD 
containing  2  hexadecimal  symbols  [168].  The  ZigBee  SHR  region  format  is  presented  in 
Table  II- 1.  Four  synchronization  words  (SWs)  are  defined  as  the  last  octet  of  the 
preamble  and  the  SFD  [167];  alternately,  Farahani  [168]  lists  possible  SFD  values  of  E5, 
or  11100101.  The  PHR  region  contains  frame  length  information  and  is  one  1  byte  in 
length  and  ranges  from  Oto  127  [168]. 
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Table  II-l:  Zigbee  SHR  Region  Format,  adapted  from  [91, 167, 168], 


Region 

SHR 

Preamble 

SFD 

Hexa¬ 

decimal 

VALUE 

0 

0 

0 

0 

0 

0 

0 

0 

7 

A 

Binary 

0000 

0000 

0000 

0000 

0000 

0000 

0000 

0000 

0111 

1010 

CC2420 

Zeros 

swo 

SW1 

SW2 

SW3 

2.2. 1.1  ZigBee  Data  Collection  Experiment 

The  ZigBee  dataset  under  analysis  is  a  four  class  authorized  device  classification 
model  development  problem  with  six  additional  rogue  devices  for  verification  [91]. 
Signals  from  the  ZigBee  devices  were  collected  in  three  different  environments:  ‘CAGE,’ 
signals  in  a  Ramsey  STE3000B  RF  shielded  anechoic  chamber;  ‘LOS,’  line  of  sight 
signals  in  an  office  hallway,  denoted  by  A  in  Figure  II-3;  and  ‘WALL,’  signals  collected 
behind  a  wall,  denoted  by  B  in  Figure  II-3  [91], 


Figure  II-3:  Conceptualization  of  ZigBee  data  collection,  from  [91]. 
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Table  II-2  describes  the  data  collection  experiment  and  the  data  available  for  each 
ZigBee  device.  The  four  devices  used  for  model  development  (Devi  -  Dev4)  had  data 
collected  in  all  three  environments  [91].  However,  data  from  the  rogue  devices  (Dev5  - 
Dev  10)  was  only  collected  in  one  or  two  environments  [91].  For  operation  and  ensuring 
that  the  number  of  observations  by  rogue  device  is  consistent,  the  WALL  collections  of 
devices  5-7  are  considered  as  additional  devices  [91]. 

Table  II-2:  ZigBee  Collected  Data,  adapted  from  [91]. 


ZigBee  burst  signal  data  was  collected  by  Dubendorfer  [91]  using  an  Agilent 
Receiver  to  collect  burst  transmission  from  the  ten  Texas  Instruments  CC2420  2.4GHz 
ZigBee  devices.  The  ZigBee  devices  were  setup  to  transmit  at  2.4GHz,  within  the 
Agilent  receiver’s  20.0MHz  to  6.0GHz  range  and  36.0MHz  bandwidth  [91].  For  each 
device,  1000  burst  responses  of  the  SHR  and  PHR  regions  were  collected  under  three 
different  operating  conditions  [91]. 


36 


2.2.2  IEEE  802.15.4  Z-Wave  Devices 


While  the  ZigBee  device  dataset  is  representative  of  many  applications,  it  only 
considers  one  type  of  device.  Therefore,  consistent  with  [49],  in  addition  to  the  ZigBee 
devices  Z-wave  devices  are  considered  as  an  extension  to  this  research.  Both  ZigBee  and 
Z-Wave  devices  are  small,  low-cost  wireless  communications  devices,  however 
differences  exist  between  ZigBee  and  Z-Wave  in,  primarily,  standards  and  security  [169]. 

While  ZigBee  devices  employ  an  IEEE  standard  for  industrial,  residential  and 
sensor  monitoring  and  automation,  Z-wave  devices  employ  proprietary  standard 
developed  by  ZenSys  for,  primarily,  residential  automation  [170-172].  While  ZigBee 
and  Z-Wave  are  similar  in  concept  and  possible  use,  differences  exist  in  security, 
operating  frequency,  data  rate,  and  latency  as  seen  in  Table  II-3.  Primarily,  Z-Wave  is 
considered  less  secure  than  ZigBee  due  to  Z-Wave  originally  lacking  built  in  encryption 
[170].  Additionally,  the  Z-Wave  standard  is  proprietary  and  not  publically  available, 
unlike  ZigBee  [172]. 


Table  II-3:  ZigBee  versus  Z-Wave,  adapted  from  [170, 172], 


Z-Wave 

ZigBee 

Frequency 

906  MHz 

2.4  GHz 

Bit  Rate 

40  Kbits/s 

250  Kbits/s 

Security 

None  (200  and  300  series 
models) 

AES  128  (400  series  models) 

IEEE  802.15.4 
security  standards 

Latency 

-1000  ms 

50-100  ms 

Range 

30-100  m 

10-100  m 

Message  Size  (bytes) 

64  (max) 

127  (max) 
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Z-Wave  follows  a  similar  ISO  architecture  to  ZigBee,  and  similarly  has  a 
predefined  preamble  and  SoF  [173].  A  conceptualization  of  the  Z-Wave  PHY  packet 
structure  is  presented  in  Figure  II-4,  for  RF-DNA  the  preamble  is  again  considered  as  the 
ROI  in  the  signal.  Z-Wave  also  includes  a  payload-based  home  identification  (32-bits) 
and  source  identification  (8-bits)  [172], 


MAC  and 
Transport 
Sublayer 


Home  ID 


Header 


Payload 


Preamble 


Payload 


PHY 
Layer 

Figure  II-4:  Z-Wave  PHY  layer  packet  structure,  adapted  from  [173], 

For  purposes  herein,  three  Aeotec  Z-Stick  S2  transmitters,  consistent  with  [174], 
were  employed  as  described  by  [49,  123].  A  total  of  230  Z-Wave  bursts  were  collected  at 
2  Msps,  with  the  preambles  being  the  first  8.3  ms  of  each  burst.  Z-Wave  data  was 
collected  under  LOS  conditions  with  the  Z-Wave  devices  placed  10  cm  from  a  vertically- 
oriented  LP0410  log-periodic  antenna,  which  was  connected  via  a  Gigabit  Ethernet  cable 
to  an  USRP-2921  RF  input  [49].  Amplitude-based  leading  edge  detection  was  employed 
with  a  -6  dB  detection  threshold  to  detect  and  extract  the  bursts  from  the  background 
noise  [49].  The  collected  signal  had  a  Signal-to-Noise  Ratio  of  SNR  =  24  dB  and  was 
like-filtered  [49]. 
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2.2.3  Post-Collection  Data  Manipulation 

After  collecting  the  ZigBee  RF  emissions,  Dubendorfer  [91]  converted  the  files  to 
MATLAB  format.  Since  the  SHR  and  PHR  regions  begins  each  ZigBee  transmission,  and 
are  not  changed  between  devices,  the  RF-DNA  process  was  applied  to  this  region  of  the 
ZigBee  transmission  [91].  First,  Dubendorfer  [91]  detected  the  bursts  from  the  ZigBee 
devices,  which  comprise  the  signals  of  interest.  After  digital  filtering  through  a 
Butterworth  baseband  filter,  additive  white  Gaussian  noise  (AWGN)  was  included  to 
create  a  range  of  operating  points  (16)  between  SNR  =  0  and  SNR  =  30  dB  using  five 
independent  noise  realizations  per  device  [91].  A  similar  approach  was  considered  for 
the  Z-Wave  devices,  where  AWGN  was  added  to  collected  signals  to  achieve  desired 
operating  points  of  SNR  e  [0  24]  dB  in  2  dB  steps  [49]. 

2.3  Physical  Layer  Device  Identification 

Because  PHY  layer  characteristics  are  associated  with  the  physical  properties  of 
devices,  they  are  naturally  harder  to  spoof  than  characteristics  associated  with  other  OSI 
levels  [175].  PHY  layer  security  consists  of  two  broad  approaches  for  exploiting  RF- 
emission  features:  1)  adding  a  physical  object  to  an  electronic  device,  such  as  an  RF- 
Certificate  of  Authority  (COA),  or  2)  exploiting  inherent  emission  features  of  electronic 
devices,  such  as  RF-DNA.  A  brief  review  of  the  various  approaches  is  considered  to 
illustrate  the  benefits  of  the  RF-DNA  approach. 
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2.3.1  RF  Device  Emissions 


Both  intended  and  unintended  emissions  occur  across  the  electromagnetic 
spectrum  in  a  variety  of  forms;  intentional  emissions  can  range  from  light  emitted  from  a 
light  bulb  to  wireless  communications.  Unintended  emissions  are  also  emitted  from  a 
variety  of  sources;  one  commonly  experienced  form  of  unintended  electromagnetic 
emissions  occurs  through  light  pollution  which  makes  viewing  the  night  sky  difficult  in 
urban  areas  [176].  Since  the  1970s  man-made  noise  from  unintended  emissions  has 
increased  due  to  the  proliferation  of  electronic  devices  [177],  Electronic  device 
emissions  have  security  [178],  safety  [179],  interference  and  communications  [180] 
ramifications.  Although  shielding  and  design  are  used  to  reduce  unintended  emissions, 
the  underlying  physics  of  electronic  devices  precludes  their  elimination  [180,  181]. 

RF  emissions  can  emanate  from  both  intended  and  unintended  radiators  [182]; 
unintended  RF  emissions  emanate  from  nonnal  operations  and  are  caused  by  transistor 
switching,  current  flow,  integrated  circuit  (IC)  activity,  in  addition  of  other 
electromagnetic  effects  [19,  183],  Although  unintended  RF  emissions  are  a  generally 
considered  a  source  of  interference,  they  are  also  useful  for  device  identification  between 
disparate  devices  [184],  When  devices  from  the  same  production  run  are  considered, 
production-induced  variations  result  in  devices  being  within  production  tolerances  yet 
having  different  RF  emissions  [19].  Although  exploiting  intentional  device  emissions  is 
of  concern  herein,  exploring  methods  used  to  exploit  both  unintended  and  intended 
emissions  adds  important  background  knowledge  for  this  research. 
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Four  leading  RF-based  device  identification  methods  have  been  proposed:  Radio 
Frequency  Identification  (RFID),  Physical  Unclonable  Functions  (PUFs),  RF  Certificates 
of  Authenticity  (RF-COA),  and  RF  Fingerprinting.  Of  these,  only  RF  Fingerprinting 
exploits  signals  that  inherently  emanate  from  the  device,  while  the  other  three  methods 
requiring  the  addition  of  components  to  the  underlying  devices. 

2. 3. 1.1  Radio  Frequency  Identification  (RFID) 

RFID  is  a  tracking  technology  seen  in  some  RF  physical  layer  security  schemes. 
RFID  involves  placing  a  ‘tag’  on  a  device  for  tracking;  each  tag  is  an  identifier  antenna 
circuit  based  on  RF  communication  between  the  antenna  and  a  scanner  [185,  186].  RFID 
antennae  can  be  either  powered  and  actively  emitting  or  unpowered  and  emitting  only 
when  scanned  [71].  RFID  has  seen  applications  in  many  commercial  and  warehouse 
applications  where  products  and  parts  are  tracked  [186].  RFID  does  have  known  issues, 
including:  interference  [187],  and  obviously  the  practical  issue  of  requiring  an  RFID 
antenna  to  be  knowingly  placed  (visible  or  otherwise)  on  an  object  in  order  for  it  to  be 
scanned. 

2.3. 1.2  Physical  Unclonable  Functions  (PUFs) 

PUFs  offer  two  techniques  for  authentication:  1)  augmenting  an  IC  with  internal 
measurement  circuitry,  and  2)  adding  a  grid  of  capacitive  sensors  onto  the  top  IC 
layer  [19].  Both  of  these  PUF  approaches  require  physical  IC  manipulation  and  therefore 
are  prohibitive  to  exploring  due  to  legacy  ICs  being  in  operational  use. 
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2.3. 1.3  RF  Certificates  of  Authenticity  (RF-COA) 

RF-COAs  are  another  attempt  to  add  identifying  characteristics  to  electronic 
devices.  RF-COAs  extend  the  RFID  concept  by  placing  small,  unique,  three-dimension 
antennae  comprising  of  randomly  shaped  conductors  and  dielectric  components,  COAs, 
onto  electronic  device  to  create  a  unique  identifiable  RF  signal  [73].  The  philosophy  of 
this  approach  is  that  where  unique  COAs  would  be  issued  by  manufacturers  of  objects 
and  software  to  confirm  their  provenance  [73].  In  essence,  RF-COAs  are  a  combination 
of  PUFs  and  RFID,  where  the  RF-COAs  are  read  by  an  external  RFID  type  of  reader 
[19].  The  obvious  impediment  is  the  emplacement  of  the  RF-COAs  on  devices  already  in 
operation,  the  additional  cost  of  extra  components,  and  additional  considerations  in  the 
design  and  fabrication  process.  The  ease  of  spooling  is  also  a  known  issue  with  the  COA 
approach  [73]. 

2.3.1.4  RF  Fingerprinting 

RF  Fingerprinting  refers  to  one  of  two  processes:  characterizing  the  RF 
environment  devices  operate  in,  c.f.  [188,  189],  or  identifying  devices  based  on 
differences  in  transmitted  signals  resulting  from  differing  characteristics,  due  to 
production  and  life  style  variations,  among  various  devices  [79].  Of  interest  herein  is  that 
AFIT  RF-DNA  RF  Fingerprinting  process  which  is  unique  in  RF  Fingerprinting  in  that  it 
applies  statistical  methods  of  feature  extraction  and  classification  to  the  RF 
Fingerprinting  process  [133],  RF-DNA  has  been  explored  for  both  inter-device 
variations,  e.g.  differentiating  similar  devices  from  different  manufacturers  [190],  and 
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intra-device  variations,  e.g.  differentiating  devices  as  the  serial  number  level  [91,  190]. 
In  operation,  the  AFIT  RF-DNA  process  consists  of  two  parts,  the  signal  collection 
aspect  (which  involves  various  signal  collection  equipment)  and  the  processing  aspect 
(which  occurs  within  MATLAB)  [190],  Figure  II-5. 

Fingerprint  Instantaneous  Signal  Statistical  Feature 


=  k  A  F  =  [F°'-F*'-Fft(NR+iyi 


(Nr  +  1)  Regions  x  3  Characteristics  x  3  Moments  =  729  Total  Features 
Figure  II-5:  RF-DNA  Fingerprinting  Architecture,  adapted  from  Cobb  et  al.  [19]. 

After  collection,  the  data  is  digitally  filtered  and  manipulated  to  create  samples  at 
various  SNR  levels.  Following  this,  RF-DNA  fingerprints  are  computed  and  various 
classification  schemes  are  applied  for  model  development  and  verification  of  the  models 
is  explored  using  rogue  devices.  RF-DNA  involves  extracting  fingerprints  from  RF 
emissions;  in  a  manner,  akin  to  biometrics  in  finding  unique  attributes  of  electronic 
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devices.  A  visualization  of  computing  RF-DNA  fingerprints  from  sampled-time  ZigBee 
SHR  data  is  presented  in  Figure  II-6. 


Figure  II-6:  Traditional  RF-DNA  Feature  Extraction  Approach  as  Applied  to 

ZigBee  Devices,  adapted  from  [91]. 

2.4  ID  Time  Domain  (TD)  RF-DNA  Fingerprints 

After  dividing  the  collected  and  processed  data’s  ROI  into  bins,  the  signal’s 
instantaneous  amplitude  (a),  phase  (<f>),  and  frequency  (f)  response  are  computed  for  each 
[89,  91,  128].  When  considering  the  region  of  interest  (ROI)  of  the  sampled  signal  as  a 
complex  I-Q  equation, 


s[n]  =  s, [n]  +jsQ[n] , 

the  RF-DNA  fingerprint  elements  can  be  computed  thusly  [91]: 


(2.1) 
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a[n]=  sf[n]+s%[n], 


(2.2) 


0[n]  =  tan  1  >for  S/N  ^  0  > 

1  ( d(p[n]\ 
f[n]  =  2^{—)’ 


(2.3) 

(2.4) 


consistent  with  general  formulations  found  in  [64,  191,  192].  Per  Dubendorfer  [91], 
(2.2)-(2.4)  are  nonnalized  through  subtracting  the  mean  and  dividing  by  the  maximum, 


=  S  W-H8  (2.5) 

ycl  1  max(gc[n])  ’ 

where  g  in  (2.5)  represents  the  respective  RF-DNA  fingerprint  elements  in  (2.2)-(2.4)  for 
n  =  1,  2,  ...,  Ns,  where  Ns  represents  the  number  of  samples  in  the  region,  and  pg 
represents  the  mean  of  the  g-lh  fingerprint  element. 

RF-DNA  fingerprints  features  are  then  extracted  from  the  normalized  amplitude 
frequency  and  phase.  The  considered  RF-DNA  features  are  2nd,  3rd,  and  4th  mathematical 
moments  of  variance  (o'),  skewness  (y),  and  kurtosis  (k)  [90,  91].  Standard  deviation  can 
also  be  computed  as  an  RF-DNA  fingerprint,  and  was  applied  by  [51];  however,  as  it  is 
necessarily  highly  correlated  with  variance,  it  was  not  applied  to  ZigBee  signals  by 
Dubendorfer  [91],  and  it  will  not  be  examined  herein. 

Considering  the  2nd  to  4th  mathematical  moments  enables  an  understanding  of 
distributional  properties  within  each  bin,  respectively  the  variability  about  the  mean 
(variance),  asymmetry  about  the  mean  (skewness),  and  distribution  curvature  (kurtosis), 
[193-195].  Mathematical  moments  have  also  seen  similar  applications  are  seen  in  other 
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domains,  cf.  [196-201].  Computed,  skewness  values  are  centered  at  0  which  indicates  no 
skewness  about  the  mean;  skewness  values  are  then  either  positive,  for  a  left  sided 
distribution,  or  negative,  for  a  right-sided  distribution  [202],  Kurtosis  values  indicate 
pointedness  or  flatness  of  a  distribution  with  values  of  either  k  =  3,  termed  mesokurtic,  k 
<  3,  termed  platykurtic  (flatter),  and  k  >  3,  termed  leptokurtic  (more  pointed)  [202]. 
Consistent  with  RF-DNA  features  of  o',  y,  and  k  are  computed  for  N  total  samples 
through  the  following  formulas: 


where, 


N 

a2  =  ^-^\x[n]  -  m)2  . 

n= 1 


y  = 


^-^(x[n]-M)3, 


*  =  A^Z(x[n]“M)4' 


(2.6) 


(2.7) 


(2.8) 


N 

M  =  ^-^x[n],  (2.9) 

n= 1 

and  x[n]  represents  an  nh  feature  vector  element  from  the  amplitude,  phase,  or  frequency 
response  [91]. 

Combined  together,  the  RF-DNA  features  are  arranged  in  a  vector  as 

FRt  =  K  YRi  KRiL3'  (2-10) 

for  each  observation  i=l,2,...,  Nr+1,  where  Nr  refers  to  the  total  number  of  observed 
sequences  with  the  additional  observation  refers  to  statistics  computed  over  the  entire 
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signal  characteristic.  When  considering  an  entire  characteristic’s  features,  (2.10)  extends 


to 


Fc  = 


r  ^ 

FR2 

Frnr 

U 

-  R  N  ft  +  l  - 


(2.11) 


When  considering  the  amplitude,  frequency,  and  phase  fingerprints,  (2.10)  and  (2.1 1)  are 
extended  through  concatenations: 


F  = 


Fa 

p<t> 

Ff. 


(2.12) 


2.4.1  ZigBee  and  Z-Wave  RF-DNA  Fingerprinting 

For  all  ZigBee  devices  of  interest,  authorized  or  rogue,  Nf  =  729  total  features 
were  computed  from  the  collected  time  domain  burst  signal  [91].  This  corresponds  to  3 
statistical  features  and  81  bins  (78  separate  regions,  and  3  averaged  regions  for  the  entire 
signal).  For  each  feature,  1000  exemplars  were  computed  each  for  CAGE,  LOS,  and 
WALL  [91].  Additionally,  data  was  available  for  16  SNR  levels,  SNR  e  [0  30]  dB,  with 
each  having  five  different  noise  realizations. 

For  classifier  model  development  training  and  testing,  the  dataset  of  authorized 
device  is  separated  into  upper  and  lower  halves;  these  were  ‘interleaved’  meaning  every 
odd-indexed  point  was  selected  for  training  and  every  even-indexed  point  was  selected 
for  testing.  In  this  fonn,  the  training  and  test  sets  for  ZigBee  devices  both  consisted  of 
500  CAGE  observations,  500  LOS  points,  and  500  WALL  points. 
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In  operation,  these  structures  are  organized  as  a  four  dimensional  data  structure 
with  Nfp  represents  fingerprint  observations;  NFeats,  features;  NNz,  noise  realizations; 
and  Nc  classes.  For  ZigBee  data,  the  structure  is  of  size  3000X729X5X4.  For 
interpretation,  not  everyone  has  mental  familiarity  with  four  dimensional  structure,  an 
example  of  what  this  means  would  be:  there  are  3000  points  associated  with  feature  1  of 
noise  realization  1  of  device  1  and  so  on.  For  the  rogue  devices,  1000  samples  were 
collected  in  the  respective  environment;  for  data  storage  and  dimensionality  concerns, 
this  is  considered  as  3000  points  with  only  the  first  1000  correspond  to  fingerprint  data, 
and  the  remaining  2000  being  zeros. 

For  the  Z-wave  devices  under  consideration,  230  LOS  observations  were 
collected  and  a  total  of  189  RF-DNA  features  were  computed  for  NFP  =  230,  NFeats  =  189, 
Nnz  =  2,  Nc=  3;  thus,  the  Z-Wave  data  structure  is  of  size  230X189X2X3.  While  the 
ZigBee  dataset  is  of  primary  interest  herein,  the  Z-Wave  dataset  will  pennit  quick 
algorithmic  development  due  to  its  smaller  size.  Additionally,  the  Z-Wave  dataset  will 
allow  generalization  of  results  to  more  than  one  signal  of  interest. 
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III.  Statistical  Pattern  Recognition 


Can  the  truth  be  learned?  With  this  question  we  shall  begin. 

-S0REN  Kierkegaard,  1813-1855 
The  nature  of  the  physical  world  and  how  objects  are  differentiated  and  created 
has  concerned  man  since  time  immemorial:  e.g.  pre-Socratic  physiologoi  such  as 
Anaxagoras,  Anaximander,  and  Democritus  thought  on  the  origin  and  nature  of 
phenomena  [203,  pp.  14-28;  203,  pp.  249-267;  204,  pp.  82-86;  205,  pp.  350-359]. 
Systematic  methods  of  pattern  recognition  begin  with  Aristotelian  thought,  with 
Aristotelian  metaphysics  concerned  with  the  nature  of  being  [203,  p.  139],  Aristotelian 
category  theory  [206],  and  questions  of  classification  in  Eastern  thought,  e.g.  verse  2  and 
6  of  the  Tao  Te  Ching  and  verse  61  of  the  Hua  Hu  Clung  [207,  208].  Locke  considered 
thinking  as  part  sensation  and  part  reflection,  extending  Descartes’  duality  of  mind  with 
the  observation  that  the  mind  considers  either  “sensations”  or  “reflections”  [209,  210], 
similarly  Hume  viewed  that  one  needs  to  experience  something  before  one  can  visualize 
that  something  [211];  in  essence  these  propositions  echo  training  and  testing  problems  in 
pattern  recognition.  Pattern  recognition  is  critical  to  both  every  day  and  computational 
tasks  [212],  and  broadly  covers  classification  of  objects,  clustering,  and  recognizing 
variables  and  patterns  of  variables  [213].  The  term  statistics  has  also  become  associated 
with  data  analysis.  Originally  referring  to  a  science  of  politics  [214],  and  descended  from 
the  Latin  statista,  meaning  “political  state”  [215],  its  meaning  has  shifted  to  become 
synonymous  with  data  analysis  and  distributional  measures  [215]. 
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3.1  Introduction 


This  chapter  is  organized  as  follows,  Section  3.2  discusses  Multiple  Discriminant 
Analysis  (MDA),  Section  3.3  discusses  the  Learning  Vector  Quantization  (LVQ)-family 
of  algorithms,  including  Generalized  Relevance  Learning  Vector  Quantization  Improved 
(GRLVQI),  and  Section  3.4  discusses  performance  assessment  methods  of  interest  to  RF 
INT.  Of  particular  interest  herein  are  statistical  methods  applied  to  pattern  recognition 
tasks,  especially  those  used  for  supervised  clustering  or  ‘classification’  where  patterns  are 
compared  with  a  set  of  known  classes  [213].  This  differs  from  unsupervised 
classification,  commonly  known  as  ‘clustering,’  where  known  predefined  groups  do  not 
exist  [213].  Additionally,  supervised  classification  for  RF  Distinct  Native  Attribute  (RF- 
DNA)  problems  considers  two  parts:  classification  and  verification  [19].  The  first  part  of 
classification  involves  the  classifier  model  development  stage  where  the  primary  concern 
is  a  “one  vs  many”  problem  of  known  group  identities  with  the  goal  to  create  a  classifier 
model  that  effectively  discriminates  between  authorized  devices  [19].  Verification 
involves  vetting  the  classification  model  by  how  well  they  recognize  authorized  and  non- 
authorized  devices  {rogue),  in  a  “one  versus  one”  claimed  identity  problem  [19]. 

Various  classification  methods  exist;  herein  we  are  primarily  concerned  with 
methods  previously  employed  for  RF-DNA  features,  namely  MDA  and  the  GRLVQI 
algorithm.  Both  MDA  and  the  LVQ-family  of  algorithms  are  described  below;  MDA  is  a 
linear  method  whereas  LVQ  methods  are  nonlinear  approaches  that  incorporate  various 
nearest  neighbors,  neural  network  and  nonlinear  concepts. 
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Figure  III- 1  presents  a  conceptualization  of  differences  in  classifier  paradigms 
between  MDA  and  LVQ  approaches,  showing  MDA  minimizing  inter-class  differences 
while  maximizing  intra-class  difference  and  LVQ  minimizing  inter-class  prototype  vector 
magnitudes  and  maximizing  the  distance  between  intra-class  prototype  vectors.  In 
describing  both  MDA  and  the  various  LVQ  methods,  the  following  general  notion  will  be 
used:  the  input  data  matrix  is  defined  as  X  which  has  Ntot  total  observations  (rows)  and  Np 
data  features  (columns).  This  will  additionally  be  considered  for  Nc  classes. 


Iteration  0 


a)  MDA  Classifier  b)  LVQ  Classifier 

Figure  III-l:  Conceptualization  of  a)  MDA  class  projections  from  [216]  and  b)  LVQ 
prototype  development  as  adapted  from  [51,  216], 

3.2  Multiple  Discriminant  Analysis 

MDA  extends  Fisher’s  linear  Discriminant  Analysis  (DA)  to  multiple  classes 
[216,  pp.  121-124].  DA  and  MDA  are  frequently  used  for  predictive/classification  and 
descriptive/clustering  tasks  and  are  frequently  applied  to  tasks  and  domains  ranging  from 
ecology  [217,  218],  civet  coffee  authentication  [219],  behavioral  sciences  [220],  marine 
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data  analysis  [221,  222],  muzzle  flash  identification  [223],  and  MD A/Maximum 
Likelihood  (MDA/ML)  methods  for  RFINT  [88,  92-94,  119,  133,  224],  MDA  and  DA 


also  frequently  compare  favorably  (in  either/or  accuracy  and  computation  time)  to  more 
complicated  statistical  methods,  such  as  neural  networks,  logistic  regression,  support 
vector  machines,  naive  Bayes  classifiers  and  LVQ  approaches,  c.f.  [51,  92,  225-229]. 
Current  research  extensions  and  variants  of  DA  and  MDA  also  exist,  these  include 
extending  MDA  or  DA  to  use  other  machine  learning  and  statistical  tools,  such  as  kernels 
or  nonparametric  statistics  [230-234]. 

MDA  is  a  linear  classifier  based  on  Fisher’s  2  class  method,  but  extended  to 
multiple  classes  [235,  236].  Weight  vectors  are  computed  for  sample  based  estimates 
using  the  Fisher  criterion  function  for  maximum  discrimination, 

bTSbb 


A  = 


(3.1) 


bTSwb ' 

which  is  a  ratio  of  the  between  groups  and  within  groups  sum  of  squares  with  b  being  the 
discriminant  weights  (eigenvectors)  of  S^Sb,  and  A  being  the  associated  eigenvalue  that 
equals  the  separation  [237,  238].  To  maximize  A  with  respect  to  b ,  (3.1)  can  be  treated  as 
a  maximization  problem,  maxb  bTSbb  subject  to  bTSwb  —  1,  by  taking  the  partial 
derivative  and  setting  equal  to  zero  [239,  240].  Considering  the  Lagrangian, 

L  =  bTShb  -  A(bTSwb  -  1)  ,  (3.2) 

and  taking  the  partial  derivative  of  (3.2)  with  respect  to  b, 
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(3.3) 


—  {bTShb  -  A( bTSwb  -  1))  =  2 Sbb  -  2 ASwb  , 

one  arrives  at  a  problem  similar  to  eigenvalues/eigenvectors  [237,  238].  Setting  (3.3) 
equal  to  zero  yields, 

(Sb  -  ASw)b  =  (SrfSb  -  AI)b  =  0  ,  (3.4) 

a  common  eigenvalue/eigenvector  problem  [216].  Taking  the  partial  derivative  of  (3.2) 
with  respect  to  A  gives, 

bTSwb  =  1,  (3.5) 

hence  the  eigenvector  is  scaled  to  unit  variance. 

The  between  class  sum  of  squares  Sb  is  defined  as 

Sb  =  ST  —  Sw  ,  (3.6) 

with  Sw,  the  within  class  scatter  matrix,  defined  as 

Ni 

Swi  =  -  A -  Vi)T  •  (3-7) 

7  =  1 

where  fit  is  the  ih  group  mean  or  centroids,  and  Ni  are  the  total  number  of  observations  in 
the  i,h  group  [237,  p.  401].  The  within  groups  sum  of  squares,  assuming  the  covariance 
matrices  of  the  classes  are  equal,  is  =  Su>1  +  Su>2  +  Su>c ;  and  the  total  mean 
corrected  sums  of  squares  and  cross  products  is  defined  as: 

c 

$t  —  ^  —  *>)&]  ~  Mo)  >  (3.8) 

i=l j=l 

where  [i0  represents  the  grand  mean  vector  [19,  216].  Data  X  is  then  projected  to  an  Ndj 
dimensional  discriminant  space  according  to 
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(3.9) 


where 


G  = 


b  i,  ^2’ 


Ndf  =  min(Nc  —  l,Np)  ,  (3.10) 

which  restricts  the  total  number  of  discriminant  functions  [237,  p.  401].  Although  (3.10) 
is  frequently  specified  as  Nc  —  1  [19,  51,  90,  91],  such  a  reduction  may  not  be 
appropriate  if  a  small  set  of  features  is  used  or  selected.  The  maximum  number  of 
discriminant  functions  to  generate  is  determined  by  the  eigenvalues  ofS^S*,.  If  the 
eigenvalues  ofS^Sj,  are  distinct,  the  number  of  linear  composites  will  be  bounded  by 
rank  of  Sb  and,  consequently,  the  rank  of  S^Sb  [237,  p.  401],  Additionally,  when  the 
number  of  features  exceeds  the  number  of  observations  the  covariance  matrix  is 
obviously  singular,  which  can  violate  distributional  assumptions  and  enable  situations  of 
complex  discriminant  loadings  with  further  dubious  underlying  discriminant  functions. 


3.2.1  MPA  Feature  Relevance  Ranking 

Classifier-based  feature  relevance  rankings  from  MDA  are  currently  unexplored 
in  RF-DNA  methods  with  some  research,  e.g.  [51,  91,  92,  113,  134,  241],  even  positing 
that  one  cannot  extract  feature  relevance  rankings  from  MDA.  However,  the  method  of 
discriminant  loadings  is  one  approach  that  directly  computes  the  contributions  of  each 
data  feature  to  the  resultant  discriminant  functions. 

Discriminant  loadings  reflect  the  contribution  of  each  data  feature  to  a  given 
discriminant  function  and  are  analogous  to  principal  component  loadings  [237,  pp.  394- 
429].  Dillon  and  Goldstein  [237]  suggest  that  due  to  the  unsuitability  of  the  eigenvectors 
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to  provide  information  of  the  contribution  of  each  feature  to  the  discriminant  functions, 
one  should  therefore  compute  the  loadings.  It  is  of  interest  to  examine  the  ‘contribution’ 
of  each  input  feature  to  each  discriminant  function  as  means  of  screening  data  features. 
Occasionally,  these  values  are  reported  in  literature  [242],  but  they  are  usually  included 
to  describe  results.  Dillon  and  Goldstein  list  discriminant  loadings  as  the  simple 
correlation  between  discriminant  scores  and  the  input  data  features  [237,  p.  414],  and 
explicitly  for  the  jth  discriminant  function  [237,  p.  373]: 

Lj  =  corr  (. X ,  bjX )  =  corr  ( X ,  X)b}.  (3.11) 

The  statement  of  Dillon  &  Goldstein  [237,  p.  414],  “...discriminant  loadings  for  a 
variable... is  the  correlation  between  the  function,  G  from  (3.9),  and  the  variable...”  and 
echoed  in  [237,  pp.  372-373],  is  interpreted  by  [243]  as: 

Lt  —  corr(X,  G)  ,  (3.12) 

where  we  are  really  computing  the  correlation  of X  with  (3.9).  Realizing  that 

cov{X,bT X)  —  cov(X,X)b  ,  (3.13) 

then  the  correlation  expression  in  (3.12)  can  be  rewritten  as 

corr(X,  bTX)  =  D~1/2cov(X,  X)bD~l/2  .  (3.14) 

where  Dx  is  a  matrix  of  the  diagonal  entries  of  cov(X,X)  and  Db*rx  is  a  matrix  of  the 
diagonal  entries  of  cov(b*TX,  b*TX )  =  bT cov(X,X)b  [243].  This  further  expands  to 

corr{X,bT X)  —  corr{X, X)D^/2 b[bT cov(X, X)b]-1/2  .  (3.15) 

One  could  feasibly  scale  MDA  coefficients  to  ensure  equal  variance  in  all 
directions;  therefore  one  area  of  related  interest  is  how,  if  at  all,  MDA  loadings  are 
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possibly  affected  by  scaling  the  projection  matrix.  Appendix  A  addresses  this  issue  by 
presenting  a  lemma  that  proves  MDA  loadings  are  not  affected  by  scaling. 


3.2.2  Maximum  Likelihood  (ML)  Device  Classification 

Herein  MDA  is  considered  for  the  RF-DNA  classification  and  model 
development  process,  with  Maximum  Likelihood  (ML)  employed  to  determine  decision 
boundaries  for  classification  using  equal  priors  and  unifonn  costs  [92].  This  research 
considers  identification  as  a  classification  problem,  where  the  classifiers  are  built  to 
determine  a  device’s  identity  from  its  RF-DNA  fingerprints  using  training/reference 
fingerprints  and  testing  fingerprints.  This  is  considered  as  a  one-to-many  comparison 
[19].  When  examining  the  ML  case,  classification  involves  computing  the  Bayesian 
posterior  probabilities  from  the  classifier,  for  Nc  a  fingerprint  Fw  is  assigned  to  class  u)j 
if 


P(m,|Fw)  >  P(m;|Fw),V/  A  i ,  (3.16) 

for  ie{l,2, ...  ,NC]  devices  [19].  The  conditional  probabilities  for  such  problems  are 
Bayesian  in  nature: 


p(alIfw) 


P(FwK)P(m,) 


(3.17) 


P(F") 

where  the  denominator  is  constant  across  o>j  for  a  given  F(,)  [19];  with  equal  priors  for  all 
classes,  P(<Uj)  =  1/ND.  The  likelihood  is  estimated  through  a  Gaussian  distribution: 


P(F  ^  (2Tr)n<i//2  IXI1/2 

with  Te  being  a  form  of  Mahalnobis  distance: 


expCFe)  , 


(3.18) 
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(3.19) 


Te  =  -\{.F“-n')TY.-\F“-li), 

for  the  sample  mean,  ji ,  and  inverse  covariance,  E_1,  of  the  data  with  as  implicit 
assumption  of  normality  [19]. 

3.3  Learning  Vector  Quantization  Family  of  Methods 

Although  the  improved  Generalized  Relevance  Learning  Vector  Quantization 
(GRLVQ)  algorithm  of  Mendenhall,  [244-247],  is  of  primary  interest  herein  due  to  its 
previous  application  to  RF-DNA  classification  and  verification  in  [51,  92,  100].  Beyond 
RF-DNA  classification  and  verification,  LVQ  methods  have  seen  a  wide  variety  of 
applications,  ranging  from  image  analysis  [244-246,  248],  to  disease  detection  [249].  To 
fully  understand  GRLVQI,  one  must  necessarily  understand  the  workings  and  philosophy 
of  LVQ  and  the  successive  extensions  to  GRLVQ  to  further  extend  the  LVQ  family  of 
algorithms. 

Epistemologically,  LVQ  methods  are  neural  networks.  Broadly,  there  are  three 
categories  of  neural  network  approaches:  feedforward,  recurrent,  and  self-organizing 
maps,  with  LVQ  methods  included  in  the  last  category  [250].  This  is  conceptualized  in 
the  general  taxonomy  of  Artificial  Neural  Networks  (ANN)  shown  in  Figure  III-2,  where 
ANN  types  and  basic  examples  of  their  architectures,  and  how  nodes  and  layers  connects, 
are  presented.  Broadly,  LVQ  refers  to  a  family  of  supervised  neural  learning  approaches 
which  leams  input  relevance  with  classification  as  part  of  its  cost  function  [245,  250- 
254].  The  LVQ  family  of  methods  includes  various  extensions  and  improvements  from 
vector  quantization  (VQ)  and  the  LVQ  algorithms  developed  by  Kohonen,  [255-257]. 
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Both  VQ  and  LVQ  are  considered  as  neural  network  functions  due  to  similarities 
in  the  iterative  training  approach  used  for  VQ  and  LVQ  prototype  vectors,  which  are 
analogous  to  ANN  hidden-layer  nodes,  the  use  of  gradient  descent  for  training  and  the 
non-linearity  of  the  process  [213].  Additionally  LVQ  can  be  seen  as  a  nearest  neighbor 
approach  through  the  nearest  prototype  vector  (PV)  optimization  process  [258]. 
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Figure  III-2:  General  taxonomy  of  ANN  approaches,  adapted  from  [254]  using  the 

ANN  families  of  [250,  p.  368]. 


While  PVs  and  hidden  nodes  appear  analogous,  a  few  distinctions  exist  between 
LVQ  and  ANN  networks.  Primarily,  in  LVQ,  each  PV  is  associated  with  a  specific  class 
resulting  in  LVQ  methods  being  “winner  take  all”  methods  where  one  and  only  one  PV 
will  win  for  each  exemplar  [259-261],  Additionally,  this  also  means  that  LVQ  does  not 
employ  an  output  layer  [262].  Therefore,  LVQ  could  be  considered  as  an  ANN  with  no 
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explicit  output  layer  and  a  winner  take  all  hidden/output  layer.  These  differences 
between  ANNs  and  LVQ  are  conceptualized  in  Figure  III-3 . 


NPV  neurons  NPV  PVs 


a)  Feedforward  Neural  Network  b)  Learning  Vector  Quantization 

Figure  III-3:  Conceptualization  of  the  differences  between  a)  ANNs  and  b)  LVQ 

networks,  adapted  from  [250,  262], 

For  classification,  a  constraint  exists  where  PVs  must  implicitly  correspond  to  a 
true  data  class.  Logically  this  implies  that  the  number  of  PVs  should  be  NPV  oc  Nc,  hence 
if  Nc  —  3  then  NPV  must  be  in  multiples  of  3.  PVs  are  then  initialized  with  random  values 
and  assigned  to  the  corresponding  classes,  with  PVs  indexed  1, ... ,  NPV/NC  being 
associated  with  class  1  and  so  on.  In  operation,  PVs  are  considered  as  organized  en  bloc, 
e.g.  if  NPV  —  3  for  Nc  —3  classes,  then  w±(t)  represent  true  class  1,  w2(t)  represent 
true  class  2,  and  so  on. 

Classification  of  PVs  to  data  exemplars  is  considered  iteratively  through  a 
distance  measure,  nominally  squared  Euclidean  distance.  Conceptualized  in  Figure  III-4 
is  the  general  process  for  LVQ  variations,  using  the  logic  of  LVQ2.1.  In  Figure  III-4  we 
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are  observing  the  closest  in-class  PV,  w1 ,  and  closest  out-of-class  PV,  wL,  to  the  zth  data 
exemplar,  xi,  based  on  the  respective  distances,  dL  and  d1 .  Iteratively,  PVs,  w,  are 
compared  to  a  given  training  set  exemplar  and  either  a)  moved  closer  to  the 
corresponding  same-class  sample  (for  correctly  classified  PVs),  and/or  b)  moved  further 
away  from  the  out-of-class  sample  (for  incorrectly  classified  PVs).  Depending  on  the 
LVQ  variant  and  PVs  strategy,  a  window  can  be  incorporated  to  further  restrict  which 
PVs  are  updated. 


w1 


Figure  III-4:  LVQ  prototype  vector  update  conceptualization;  adapted  from  [249]. 


3.3.1  Gradient  Descents  and  LVQ 

Gradient  descents  involve  iteratively  moving  PVs,  or  nodes,  appropriately 
towards  or  away  from  a  given  exemplar  [216].  Followed  appropriately,  resultant  PVs 
would  accurately  characterize  the  data  with  lower  dimensionality  [216].  The  general 
definition  of  a  linear  gradient  descent  appears  as 
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(3.20) 


w(t  +  1)  =  w(t)  —  e(t)VC(w(t))  , 
where  t  is  the  training  sample  iteration  number,  e(t)  is  a  learning  rate,  w(t)  is  a  given 
PV,  and  C(w(t))  is  a  cost  function  and  V  implying  the  gradient  [216,  263].  Care  must 
therefore  be  taken  in  specifying  the  learning  rate,  initializing  the  PVs  and  in  selecting  the 
cost  function. 

All  LVQ  methods  follow  a  similar  gradient  descent  based  approach,  as  presented 

in  (3.20),  to  move  PVs  towards  or  away  from  data  as  needed.  LVQ  methods  typically 

differ  only  with  respect  to  the  cost  function,  update  logic,  and  the  inclusion  of  additional 

computational  steps  (e.g.  relevance  computations).  Major  variations  are  reflected 

through  the  addition  of  letters  to  the  LVQ  acronym,  a  brief  taxonomy  of  major  LVQ 

variations  leading  from  LVQ  to  GRLVQI  is  presented  in  Table  III- 1 .  Kohonen  first 

extended  LVQ  by  creating  variants  (cf.  LVQ2  and  LVQ2.1)  that  improved  the  PV  update 

strategy  to  updates  involving  both  in-class  and  nearest  out-of-class  PVs  [255].  Relevance 

LVQ  (RLVQ)  extends  LVQ  by  incorporating  a  relevance  weight  for  each  data  feature, 

which  is  learned  during  the  training  process  [264].  GLVQ  extends  LVQ  by  improving 

class  boundary  approximations  through  the  incorporation  of  a  sigmoid  cost  function  and 

the  use  of  gradient  (first  derivative)  descent  [265].  Hammer  and  Villmann’s  [266] 

GRLVQ,  combined  the  innovations  of  both  GLVQ  and  RLVQ  to  create  a  GLVQ 

algorithm  that  learned  the  input  dimension  weights  to  provide  relevance  infonnation 

regarding  each  feature.  GRLVQ  was  then  further  extended  through  improvements 

resulting  in  the  GRLVQI  algorithm  [244,  245].  A  table  describing  the  various  versions  of 

LVQ  leading  up  to  GRLVQI  is  provided  in  Table  III- 1 .  Other  variations  that  divert  from 
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LVQ  in  PV  update  approach,  logic  rules,  algorithm  formulations,  and  other  methods  are 
not  considered  herein.  Such  innovations  include:  LVQ4  [267],  kernel  LVQ  variants 
[268,  269]  and  information  theory  based  approaches  [270].  Further  extensions  and 
philosophies  of  LVQ  variations  are  documented  in  reviews,  such  as  provided  by  Nova 
and  Estevez  [252],  Kaski  et  al.  [271],  and  Kaden  et  al.  [258]. 


Table  III-l:  Major  Variations  in  LVQ  Family  of  Algorithms, 


Version 

Variation 

Reference 

VQ 

An  unsupervised  clustering  ANN/gradient  descent 
approach  where  PVs  are  moved  towards  data 
exemplars  to  create  a  feature  space. 

[255,  257,  272] 

LVQ 

A  supervised  clustering  (classification)  version  of  VQ 
which  either  pushes  correctly  classified  PVs  towards  a 
given  group  and  incorrectly  classified  PVs  away. 
Includes  Kohonen  variants,  in  addition  to  LVQ2, 
LVQ2.1,  and  LVQ3 

[256,  257] 

GLVQ 

A  generalized  fonn  of  LVQ,  reference  vectors  are 
updated  with  a  sigmoid  used  in  the  cost 
function/gradient  descent 

[265,  273] 

RLVQ 

LVQ  modified  with  a  gradient  descent  based  input 
feature  relevance  computation 

[264] 

GRLVQ 

A  combination  of  the  innovations  in  RLVQ  and 
GLVQ.  Incorporates  2  gradient  descent  operations. 
Weighting  factors  for  inputs  incorporated  into  the 
GLVQ  method,  pennitting  scaling  of  input  dimension 
by  relevance. 

[266] 

GRLVQI 

GRLVQ  with  the  following  improvements:  improved 
prototype  update  rule,  improved  prototype  utilization, 
and  a  frequency  based  maximum  input  update  strategy 

[245-246] 
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3.3. 1.1  Vector  Quantization  (VO) 

VQ  and  the  Self  Organizing  Feature  Map  (SOFM)  clustering  method  are 
approaches  that  aim  to  represent  the  input  data,  X,  as  NPV  total  PVs  [216,  274],  VQ 
operates  by  iteratively  selecting  a  random  data  exemplar  and  then  using  a  gradient 
descent  operation  to  move  the  nearest  PVs  towards  the  given  exemplar  [255,  272].  In 
operation,  first  NPy  must  be  selected  and  these  PVs  must  then  be  initialized  appropriately 
[255].  Similar  to  other  clustering  problems,  it  is  non-trivial  to  decide  on  the  number  of 
PVs  (Npy)  to  be  created  [275-278],  However,  some  care  must  also  be  taken  in 
initializing  PVs  for  VQ.  Logically,  NPV/NC  >  1  is  of  interest,  and  PVs  initialized  with 
identical  values  will  yield  dubious  results;  therefore  PVs  initialized  as  all  zeros  are  a  poor 
choice,  and  hence  initializing  with  random  values  is  seen  in  practice  [255].  It  is  also 
helpful  if  the  PVs  and  the  data  have  the  same  dynamic  range,  therefore  one  reasonable 
solution  would  be  to  standardize  the  data,  X,  and  then  use  PVs  from  a  random  nonnal 
distribution  [255]. 

After  initializing  the  PVs,  the  distances  between  a  given  7th  exemplar  and  each  of 
the  n  =  1,  ...,NPV  PVs  are  computed  to  find  the  index  of  the  PV  associated  with  the 
minimum  distance  [255].  Nominally,  squared  Euclidean  distances  are  used  for  the 
distance  measure  in  VQ,  with  the  cost  function  being  the  distance  measure  itself 


63 


dn  =  C(wn(t))  =  (xt  -  wn)2  , 


(3.21) 


the  PV  associated  with  the  minimum  distance,  wd  (t) ,  is  then  updated  through  the 
gradient  descent  process  in  (3.20).  The  chain  rule,  as  described  in  Edwards  and  Penney 


[279]  as 


du(g )  du  dg 


(3.22) 


dv  dg  dv 

where  u(g)  is  a  function,  it,  of  another  function,  g .  Considering  (3.22)  with  it  = 
(Xj  —  wn )2  and  g  —  {x £  —  wn),  one  can  compute  the  derivative  for  the  squared  Euclidean 
cost  function.  Following  this  formulation,  the  gradient  of  the  cost  function  is  computed  as 

VC(wd(t))  =  —2 (x;  -  wd(t)),  (3  23) 


and  is  then  used  to  update  a  given  PV  [255].  The  scalar  multiplier  can  be  combined  with 
the  learning  rate,  and  the  VQ  gradient  descent  operation  is  thus  computed  as, 

wd(t  +  1)  =  wd(t)  +  e(t)(x*  -  wd(t))  ,  (3.24) 

which  flips  the  sign  of  (3.20)  due  to  the  negation  seen  in  the  gradient. 


3.3. 1.2  Learning  Vector  Quantization  (L  VQ) 

LVQ  extends  upon  the  concepts  of  VQ  by  creating  essentially  a  supervised 
version  of  VQ  to  enable  classification  [253,  255,  257,  280].  Similar  to  VQ,  NPV  PVs  are 
defined  and  initialized  appropriately  with  preference  towards  the  PVs  and  the  data 
sharing  a  similar  dynamic  range  [255].  Thus  instantiating  random  nonnal  PVs  and 
standardizing  the  input  data  is  one  common 
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In  operation,  LVQ  begins  similar  to  VQ  where  the  distances  between  a  given  7th 
exemplar  and  each  PV  is  again  computed  per  (3.20)  [253,  255,  257,  280].  However,  the 
gradient  descent  operation  now  depends  on  whether  a  correct  classification  was  made  or 
not.  Here,  when  wd(t)  is  associated  with  the  corresponding  class  of  Xj ,  a  correct 
classifications  was  made.  The  gradient  descent  process  of  (3.20)  for  the  7th  exemplar 
follows  a  Hebbian  learning  process  [281], 


Wd(  t+  1)  = 


(wd(t)  +  e(t)(xt  -  wd(t)) 
|wd(t)  -  e(t)(xj  -  wd(t)) 


ifCd  =  Q 
if  Q  =£  Q 


(3.25) 


where  conditions  for  correct  and  incorrectly  classified  PVs  are  both  considered,  with  C, 
being  the  class  identity  of  the  ith  exemplar  and  Cd  being  the  class  identify  of  the  PV  under 


consideration  [255].  In  (3.25),  Cd  =  Ct  indicates  a  correctly  classified  exemplar  and 


Cd  =£  Ci  indicates  an  incorrectly  classified  exemplar  [253,  255,  257,  280]. 


3.3. 1.3  Learning  Vector  Quantization  Improvements  (LVQ2  and LVQ2.1) 

Three  general  philosophies  exist  on  improving  LVQ,  including  1)  altering  the 

update  logic  of  (3.25),  2)  incorporating  additional  gradient  descents,  and  3)  changing  the 

cost  function.  Kohonen  [282]  first  proposed  LVQ2  as  an  extension  of  LVQ  logic  that 

only  updates  PVs  when  they  were  appropriately  close  to  a  given  exemplar.  In  LVQ2 

[282],  a  window  and  various  criteria  are  introduced.  LVQ2  and  LVQ2.1  are 

conceptualized  via  Figure  III-5.  LVQ2  extends  the  PV  update  logic  in  (3.25)  where  the 

two  closest  PVs  to  a  given  exemplar  x,-  are  considered.  PVs  are  updated  if  and  only  if  (iff) 

1)  Xj  falls  within  the  window,  2)  Xj  belongs  to  KL,  and  hence  3)  the  two  nearest  PVs  are 

an  in-class  PV  and  out-of-class  PV.  In  this  process  Xj  lies  within  the  window  if 
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(dL  dJ\ 

min[dJ’dt 


(3.26) 


where  i  is  a  scale  factor  having  a  recommended  value  of  approximately  0.35  [282]. 


Figure  III-5:  Conceptualization  of  the  LVQ2  and  LVQ2.1  prototype  vector  update 
approach  using  the  LVQ2.1  process;  adapted  from  [282], 

Kohonen  [282]  admitted  that  LVQ2  had  various  issues,  e.g.  computationally 
intensive  and  slow  convergence,  and  therefore  proposed  a  further  variation  in  LVQ2.1. 
LVQ2.1  considers  the  basic  LVQ  algorithm  with  the  LVQ2  logic,  however  the  difference 
is  that  LVQ2.1  does  not  wait  for  the  class  of  to  serendipitously  match  wL  and  rather 
finds  both  of  the  nearest  in-class  PVs  and  nearest  out-of-class  PV  to  x,  [282]. 

LVQ2.1’s  PV  update  logic  extends  (3.25)  where  the  in-class  PV  is  moved  toward 
the  data  exemplar, 

Wn( t  +  1)  =  Wnit)  +  e(t)  (xi  -  ,  (3.27) 

and  the  out-of-class  PV  is  moved  away  from  the  data  exemplar 

<(t  +  1)  =  w^(t)  -  e(t)(xj  -  w£(t))  ,  (3.28) 

if  Xi  falls  within  the  update  window  [282].  In  many  subsequent  LVQ  implementations, 
e.g.  GLVQ  and  GRLVQ,  the  general  logic  of  LVQ2.1  is  followed  for  updating  prototype 
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vectors.  Additionally,  one  of  the  primary  improvements  seen  in  GRLVQI  is  an  extension 
of  the  LVQ2.1  logic. 


3.3. 1.4  Relevance  Learning  Vector  Quantization  (RL  VO) 

RLVQ  was  introduced  by  Bojer  et  al.  [264]  as  an  extension  of  LVQ  that 
detennines  feature  relevance  during  the  classification  process.  Bojer  et  al.  [264] 
recommend  initializing  the  feature  relevance  weights  xp  as  a  vector  of  length  Nf  with  all 
values  initially  equal  to  1/Np. 

Per  Hammer  and  Villmann  [266]  the  RLVQ  relevance  update  expression 
introduced  by  Bojer  et  al.  [264]  can  be  computed  for  each  qth  data  feature  as  a  gradient 
descent, 

xp(t  +  1)  =  xp(t)  -  <f (t) VC ( xp )  ,  (3.29) 

where  xp  are  scalar  relevance  values  associated  with  a  given  data  feature,  and  (t)  is  the 
relevance  learning  rate  [264].  The  distance  from  (3.21)  for  updating  relevance  rankings 
is  considered,  per  [266],  as 

dn  —  C (xp)  =  xp  ■  (Xj  -  wn)2  .  n 


The  resultant  relevance  updates  are  thus  updated  for  the  qth  data  feature  via 

2 

(Vq  ~Wnq(t))  if  Cd  —  Ct 

xpq=\  ;  '  ,  (3.31) 

V *Pq  +  ^ (0  [Xiq  -  Wnq  (0  j  if  Q  A  Ct 

with  in-class  and  out-of-class  considerations  consistent  with  LVQ  and  (3.25).  Per 
Hammer  and  Villmann  [266],  the  RLVQ  expression  in  (3.31)  was  formulated  per  the 


gradient  descent.  This  fonnulation  indicates  that  when  the  cost  function  changes,  one 
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must  necessarily  change  the  ip  as  well.  The  gradient  descent  operation  and  derivation  for 
PV  updates  obviously  do  not  change  due  to  the  inclusion  of  the  scalar  weighting  tenn. 
Otherwise,  the  LVQ  operation  and  logic  of  (3.25)  do  not  change. 


3. 3. 1.5  Generalized  Learning  Vector  Quantization  (GLVQ) 

GLVQ  extends  LVQ  through  considering  a  sigmoidal  cost  function  for  the 
gradient  descent  in  (3.20)  rather  than  the  linear  cost  function  that  produced  the  generic 
VQ  gradient  descent  formulation  of  (3.24)  [265].  The  cost  function  considered  in  GLVQ 
algorithms  is, 

N  Samples 

c=  ^  fin  (xm))f  (3.32) 

m= 1 

at  iteration  t  for  NSampies  samples  [245,  265].  The  function  /(/r(xm))  in  (3.32)  is  a 
sigmoid  function  defined  as 

/V(*m»  =  i + ■  <3-33> 

of  the  relative  distance  difference  measure  /i(xm)  [262]. 

In  GLVQ,  GRLVQ  and  GRLVQI,  the  relative  distance  difference  measure  is 
typically  defined  as 


Ai(xm)  = 


0 dJ  -  dL) 


(3.34) 


(d1  +  dL)  ’ 

that  appears  related  to  the  Soresen  and  Canberra  distance  metric,  cf.  [283,  284],  with  d] 
and  dL  being  the  respective  squared  Euclidean  distances  between  the  input  sample  xm 
and  the  best  matching  in-class  prototype  vectors  w] ,  and  best  matching  out-of-class 
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prototype  vector  wL  [245,  252,  265,  266].  The  classification  performance  is  inherently 
incorporated  into  (3.34)  and,  in  operation,  (3.34)  is  a  nonnalized  value  between  -1  and  1, 
which  equates  to  a  correct  classification  when  /r(xm)  <  0 ,  a  perfect  classification 
(distance  from  in-class  PV  to  exemplar  approaches  0  while  the  distance  from  the  out-of- 
class  PV  to  the  exemplar  is  large)  when  /i(xm)  —  —  1,  and  incorrect  classifications  when 

>  0  [245,  265].  Due  to  the  direction  of  correct  and  incorrect  classification  in 

(3.34) ,  minimization  is  desirable  to  improve  classification  performance.  This  computation 
is  also  tenned  a  “difference-over-sum”  nonnalization  or  “nonnalized  difference”  and  sees 
application  in  other  domains,  cf.  [285-289].  The  general  concept  also  bears  similarity  to 
an  alternative  LVQ  PV  update  representation  of  wn(t  +  1)  =  (l  —  s(t)e(t))wn(t)  + 
s(t)e(t)Xi,  where  s(t)  has  a  dynamic  range  spanning  +1  for  correct  classifications  and  - 
1  for  inconect  classifications  [280].  Appendix  B  further  examines  the  characteristics  of 

(3.34) . 

One  requirement  of  the  distance  measures  used  for  d1  and  dL  is  that  they  must  be 
differentiable  for  the  gradient  descent  operation  [290].  This  makes  logical  sense,  as  a 
gradient  is  the  first  derivative.  The  nominal  distance  measure  used  in  GLVQ  is  the  same 
squared  Euclidean  distance  seen  in  (3.21),  however  the  derivation  is  complicated  due  to 
the  formulation  of  (3.32)-(3.34).  After  computing  the  derivative  associated  with  the 
gradient  descent,  PVs  are  computed  via 
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,  ,  4  c(t)(3//d/r(xm))dL  . 

wJ(t  +  1)  =  w  (t)  + - 737  ,  w., - (*m  -  w7)  , 


wK(  t+  1)  =  wL(t) 


(d7  +  dLY 
4  e(t)(df/dii(xm)')dJ 


(3.35) 


(xm  —  wL)  , 


(d7  +  dLy 

which  are,  respectively,  the  in-class  and  out-of-class  updates  for  the  winning  PVs  [245], 


3.3. 1.6  Generalized  Relevance  Learning  Vector  Quantization  (GRLVQ) 

GRLVQ  involves  the  combination  of  the  relevance  method  of  RLVQ  applied  to 
GLVQ  [266].  Therefore,  the  GLVQ  cost  function  in  (3.32)  is  extended  in  GRLVQ  as, 

N  Samples 

C  =  ^  l/'q/OO”1))  -  (3-36) 

m= 1 

where  xp  is  again  the  relevance  [245,  266].  The  relevance  approach  of  (3.3 1)  changes  to 


*Pq  ^Pq  *f(0 /  I n(xm) 


dh 


(d{  +  d 


f) 


(xm-w7)2  - 


d7 


(d 


+  df) 


(xm-  wKy 


(3.37) 


because  GRLVQ  employs  the  cost  function  and  PV  updates  of  GLVQ  [266].  Hammer 
and  Villmann  also  recommend  scaling  relevance  factors  to  ensure  \\xp\h  —  1  to  avoid 
instabilities  [266].  Consistent  with  the  process  of  GLVQ,  for  GRLVQ  PVs  are  computed 
via 
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,  ,  4  e(t)(df /dp(xm))dL  . 

wKt  +  1)  =  wJ(t)  +  'v,  ^  V  ■  (Xm  -  w')  , 


w*(t  +  1)  =  WL(t) 


+  dL)2 
4  e(t)(df  /dix(xm)')dJ 


(3.38) 


V  ■  (xm  —  wL)  , 


(dJ  +  dLy 

which  is  the  formulation  in  (3.35)  with  the  inclusion  of  the  relevance  term  [266]. 
Additionally,  some  variants  of  GRLVQ  incorporate  different  learning  rates  for  in-class 
and  out-of-class  updates,  as  seen  in  [291]. 


3.3.1. 7  Improved  Generalized  Relevance  Learning  Vector  Quantization  (GRLVQI) 

Mendenhall  [244]  noted  various  issues  in  GLRVQ,  including  divergence  due  to 
unconditional  updating  of  winning  out-of-class  prototype  vectors.  Mendenhall  [244],  and 
Mendenhall  and  Merenyi  [245,  246]  developed  the  GRLVQI  algorithm  to  rectify  these 
issues  by  improving  the  GRLVQ  process  in  three  ways:  an  improved  update  strategy,  an 
improved  learning  rule  to  avoid  classifier  divergence,  and  improved  prototype  utilization. 


(a)  Improved  Update  Strategy 

GRLVQI  first  has  a  new  update  strategy  that  adds  a  scalar  time  decay  term,  r,  to 
the  miscalculation  measure  in  (3.34)  becoming 

(dJ  -  dK) 

Kxm)=rW+d^j’  (339> 

which  also  implied,  per  [244-246,  292],  that 
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f'Wxm),  t)  =  t)(1  -  f(ji(xm),  r))  .  (3.40) 

Since,  r  is  defined  as  a  scalar,  per  Section  2.3.2. 1  of  [244],  it  therefore  does  not  affect  the 
derivation  process  related  to  the  gradient  descent  operations  in  GLVQ  and  GRLV  and  the 
underlying  framework  of  these  algorithms  is  left  intact. 

(b)  Improved  Learning  Rule 

The  improved  GRLVQ  algorithm  incorporates  a  new  learning  rule  by  specifying 
that  only  the  out-of-class  prototype  vector  should  be  updated  if  a  misclassification  occurs 
[244].  Therefore,  the  improved  GRLVQ  algorithm  update  rule  is  as  presented  in  Table 
III-2. 


Table  III-2:  Improved  GRLVQ  Update  Rule  of  Mendenhall  [244] 


Condition 

Rule 

Misclassification 

•  Move  in-class  prototype  vector  towards  exemplar 

•  Move  out-of-class  prototype  vector  away  from  exemplar 

Correct 

Classification 

•  Move  in-class  prototype  vector  towards  exemplar 

(c)  Improved  Prototype  Utilization 

Mendenhall  [244],  and  Mendenhall  and  Merenyi  [245,  246]  applied  the 
‘conscience’  learning  of  DeSieno  [293]  to  in-class  PV  selection.  The  underlying 
philosophy  is  to  discourage  (bias)  frequent  PV  winners  from  winning  too  often  and 
encourage  selection  of  infrequently  selected  PVs  [245].  This  is  accomplished  by 
computing  the  “frequency”  of  winning  for  the  winning  PV 
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Pnew  =  Km  +  £(1-0  -  Fold)  - 


(3.41) 


and  adjusting  the  frequency  in  the  non-winning  PVs  via, 

Pnew  =  Fold  +  0(0.0  -  pQld)  ,  (3-42) 

where  /?  is  a  user  defined  parameter  to  control  the  updating  [245].  The  winning  PV 

selection  approach  is  also  updated  from  (3.30)  by  subtracting  /?, 

dBias  =  dp  ~(3P  ,  (3.43) 

where  dp  is  either  the  in-class  or  out-of-class  distance  and  (3P  is  defined  as 

Bp  =  r(l-Fou).  P-44> 

where  y  is  a  scaling  on  the  amount  of  bias,  P  indicates  the  PV  number  and  Fpld  is  the 
frequency  [244]. 

3.3. 1.8  Operational  Settings  for  L  VQ  and  GRL  VQI 

Detennining  operational  settings  for  LVQ  algorithms  is  a  balance  between 
science  and  art  [244].  Although  PV  initialization  is  known  to  affect  the  classifier 
development  in  all  LVQ  variants  [267],  little  has  been  published  about  LVQ  algorithmic 
settings  beyond  specific  guidelines  for  specific  applications.  A  few  considerations  must 
be  made,  an  appropriate  learning  rate  needs  to  be  specified  for  the  gradient  descent,  PVs 
should  be  initialized  to  unique  and  appropriate  vectors,  and  the  appropriate  number  of 
PVs  should  be  initialized. 

(a)  Learning  Rates 

Detennining  an  appropriate  learning  rate  e(t)  involves  some  consideration  of  the 

LVQ  algorithm,  architecture,  and  the  data.  Some  general  learning  rate  guidance  exists 
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for  differing  algorithms.  Selecting  a  learning  rate  for  the  gradient  descent  approach 
involves  some  work;  too  high  of  a  learning  rate  introduces  oscillations  and  possibly 
divergence,  too  low  of  a  learning  rate  results  in  a  slow  convergence  [216,  pp.  312-313]. 
As  mentioned  in  Mendenhall  [244],  there  are  “no  hard-and-fast  rules”  in  selecting 
learning  rates  and  its  selection  is  part  of  the  “art  of  classifier  design.”  Per  Strickert  et  al. 
[291],  a  general  hierarchy  relating  learning  rate  e(t)  and  relevance  rate  (t)  includes  0  < 
<f(t)  <  e(t)  <  1,  assuming  unsealed  learning  rates.  In  general  the  guidance  of  Kohonen 
[255]  should  be  followed,  where  e(t)  is  specified  as  a  monotonically  decreasing 
sequence  of  scalar  values  0  <  e(t)  <  1.  Ideally,  the  monotonically  decreasing  term  will 
either  reach  zero  as  an  optimal  solution  is  found  or  be  stationary.  This  is  logical  because  a 
decreasing/stationary  learning  rate  avoids  large  movement  within  the  data  space  as  a 
solution  becomes  more  refined. 

Various  general  recommendations  exist  for  LVQ  learning  rates,  for  instance 
Kohonen  et  al.  [280]  recommend  learning  rates  of  e(t)  <0.1  for  LVQ.  Although  Bojer  et 
al.  [264]  suggest  initializing  both  the  LVQ  and  relevance  learning  rates  at  e(t)  =  0.1, 
they  also  employed  different  settings  with  RLVQ,  such  as  e(t)  =  0.005  and  ((t)  =  0.05 
for  a  large  mushroom  dataset.  Lim  et  al.  [294]  additionally  suggested  a  default  of  e(t)  = 
0.03  for  LVQ. 

GLVQ  and  GRLVQ  are  more  complicated  algorithms  and  deserve  further 

considerations.  For  general  sigmoidal  networks,  which  could  feasibly  include  GLVQ, 

Duda,  Hart  and  Stork  [216,  pp.  312-313]  posit  that  a  learning  rate  of  (t)  =  0.1  is  often 

adequate  for  initialization.  This  mirrors  the  general  recommendations  for  LVQ  learning 
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rate  initializations.  For  one  dataset,  Hammer  and  Villmann  [266]  suggested  a  learning 
rate  of  e(t)  =  0.1  and  relevance  rate  of  ( ( t )  =  0.01;  they  further  discussed  the  importance 
of  the  relevance  rate  being  initialized  smaller  than  the  learning  rate  since  the  relevance  is 
updated  each  iteration. 

GRLVQI  is  a  further  more  complicated  algorithm,  with  three  learning  rates  to 
select:  PV  learning  rate,  relevance  learning  rate,  and  conscience  learning  rates.  Care 
must  be  taken  since  the  interaction  of  these  three  learning  rates  is  obviously  complex  and 
learning  rates  too  high  in  magnitude  could  logically  introduce  instability  and  wild 
movements.  In  GRLVQI,  there  are  two  gradient  descent  learning  rates,  the  PV  learning 
rate  e(t)  and  the  relevance  rate  ( (t) ,  and  two  conscience  parameters  (y  and  fi)  to 
consider,  as  seen  Table  III-3.  Prior  work  determined  operational  settings  for  GRLVQI 
empirically,  with  Mendenhall  [244],  Mendenhall  Table  3.3  [244],  and  Bischoff  [295] 
recommending  the  e(t)  and  <f(t)  values  presented  in  Table  III-3.  Bischoff  et  al.  [295] 
empirically  determined  their  recommended  values  by  sampling  each  exemplar  six  times 
in  random  order  during  each  of  the  Nts  total  Training  Step  iterations.  Additionally,  the 
learning  parameters  in  Table  III-3  are  learning  schedules,  which  provide  learning  rate 
values  depending  on  the  quantity  of  training  steps  GRLVQI  is  employing.  Table  III-3  are 
implemented  due  to  perfonnance  benefits  seen  and  discussed  in  Mendenhall  [244].  Table 
III-3  are  not  decaying  values  and  thus  learning  rates  are  stationary  during  the  specified 
training  steps. 
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Table  III-3:  Nominal  GRLVQ  and  GRLVQI  Learning  Parameter  Learn  Schedule. 


Number  of 
Training  Steps  Nts 
(Thousands) 

GRLVQ 

Parameters 

Conscience 

Parameters 

Reference 

6T(t) 

y 

P 

0  <  Nts  —  400 

0.005 

0.025 

2 

0.35 

[244] 

400  <  Nts  —  800 

0.0025 

0.0125 

2 

0.3 

800  <Nts  <1200 

0.001 

0.005 

2 

0.225 

1200  <  Nrg<  1600 

0.0005 

0.0025 

2 

0.125 

0  <  Nts  —  500 

0.005 

0.005 

2.5 

0.35 

[295] 

0.5  <TS  <1 

0.0025 

0.0025 

2.0 

0.30 

1  <TS<  1.5 

0.001 

0.001 

1.5 

0.225 

1.5  <  TS  <  2 

0.0005 

0.0005 

1.0 

0.125 

2  <  TS  <2.5 

0.00025 

0.00025 

0.75 

0.1 

(b)  Number  of  Prototype  Vectors 

Additionally,  little  is  written  on  the  appropriate  number  of  PVs  to  instantiate. 
Kangas  et  al.  [296]  indicated  that  no  unique  solution  existed  for  this  task,  but  provided 
guidance  (albeit  without  examples  or  proofs)  that  proportions  to  the  number  of  samples  in 
classes  could  be  a  wrong  strategy.  Georgiou  [262]  posited  that  more  resolution  is  offered 
by  increasing  the  number  of  PVs.  Mendenhall  [244]  notes  that  generalization  bound 
methods  such  as  Gaussian  complexity  [297]  can  be  used  to  detennine  the  upper  bound  on 
the  number  of  PVs  to  instantiate.  One  could  expect  that  too  many  PVs  would  lead  to 
over  fitting,  as  mentioned  in  [298],  and  that  too  few  would  lead  to  poor  classification 
perfonnance.  Therefore,  selecting  the  appropriate  number  of  PVs  is  of  interest,  despite 
little  being  written  on  it. 
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A  general  restriction  in  LVQ  algorithms  exists  that  the  data  must  contain  at  least 
two  classes  and  that  there  must  be  at  least  one  PV  per  class  [299].  However,  further 
guidance  on  the  number  of  neurons/prototype  vectors  to  initialize  is  rarely  mentioned  in 
publications.  Kohonen  merely  mentions  that  the  optimal  number  of  PVs  is  generally  not 
proportional  to  the  prior  probability  of  classes  [282].  Additionally,  it  was  suggested  that 
PVs  could  be  deleted  during  the  learning  process  [282],  But,  no  general  framework  was 
presented  to  suggest  the  appropriate  number  of  PVs  to  initialize. 

(c)  Prototype  Vector  Initialization 

A  final  consideration  in  LVQ  network  initialization  is  the  proper  initialization  of 
the  PV  vectors  themselves.  Basic  PV  initialization  approaches  include  using  data 
sampling  distribution  [244],  extreme  points  in  the  data  [300],  borders  between  classes 
[296],  or  random  values  [266,  301].  Additionally,  some  literature  suggests  initializing 
PVs  using  k-means  to  find  cluster  centers  [267,  282],  self-organizing  maps  [282]  or  by 
finding  the  means  of  each  class  [282],  However,  employing  k-means  or  self-organizing 
maps  is  akin  to  a  fusion  process  of  an  unsupervised  classifier  feeding  into  a  supervised 
classifier  and  k-means  is  iterative  and  not  computationally  free.  PV  initialization  was  a 
concern  of  Mendenhall  [244],  resulting  in  the  addition  of  conscience  parameters  in  the 
GRLVQI  algorithm. 

Logically,  the  key  aspect  of  any  PV  initialization  process  is  that  the  PVs  and  data 
exist  in  the  same  space;  obviously,  PVs  should  be  initialized  to  be  near  the  data  dynamic 
range  or  else  valuable  iterations  will  be  spent  moving  towards  the  data.  Two  obvious  and 
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logical  choices  exist  for  proper  PV  initialization:  1)  initializing  PVs  to  nonnal  random 
values  and  standardizing  the  data  to  have  a  dynamic  range  comparable  to  the  random 
values,  and  2)  initializing  PVs  to  random  values  in  the  data  space.  Herein,  and  consistent 
with  [51],  PVs  are  initialized  with  random  nonnal  values  with  the  data  standardized  via 
standard  score  normalization  [302], 


x  —  ii 
z  -  - 


o 

where  ji  is  the  mean  of  a  given  data  vector  and  o  its  standard  deviation. 


(3.45) 


(d)  Number  of  Training  Iterations 

Similar  to  the  issues  of  PV  initialization,  learning  rate  initialization,  and  selecting 
the  number  of  PVs,  very  little  appears  in  literature  on  selecting  Nts-  For  LVQ,  Kohonen 
[255]  recommends  Nts  =  500  x  Npy  as  a  general  rule.  Literature  recommends  various 
numbers  of  iterations,  including  150  <  Nts  ^  600  [303],  500  <  Nts  ^  2,500  [295], 
Nts=  1200  [51],  a  maximum  of  Nts  =  10,000  [255],  and  400K<  AVs<  1.6M  [244], 

Reising  [51]  adopted  an  approach  where  multiple  iterations  were  employed  and 
then  the  best  models  were  selected.  Such  an  approach  is  consistent  with  the  method 
employed  by  Gage  [304]  for  ANN  training.  Gage  [304]  adopted  Welch’s  method  [305] 
for  convergence  to  determine  when  to  stop  training  and  how  many  training  epochs  to  use. 
Rather  than  find  steady-state  operating  conditions,  one  looks  for  a  stable  operating  point 
where  volatility  has  decreased  [304,  305].  Hence,  this  is  a  visual  approach  to  determine 
where  data  “converges”  [304,  305],  Similar  to  the  approach  of  Gage  [304],  Reising  [51] 
computed  the  GRLVQI  model  at  each  iteration  and  then  determined  which  model  offered 
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the  best  performance.  The  best  perfonning  model  was  then  used  for  subsequent  analysis 
and  for  comparison  against  the  sequestered  test  set. 

3.4  Device  Classification  and  Verification  Methodology 

For  model  development,  classification  accuracy  is  the  standard  performance 
metric  used  for  the  RF-DNA  problems;  however,  it  is  analyzed  in  different  ways 
depending  on  the  task  at  hand  (classification/model  building  or  verification). 
Historically,  c.f.  [18,  89,  92,  113,  224],  the  Air  Force  Institute  of  Technology’s  (AFIT’s) 
RF-DNA  development  has  considered  Device  Classification  as  a  one-to-many  “looks 
most  like?”  assessment,  and  Device  ID  Verification  as  a  one-to-one  “looks  how  much 
like?”  assessment.”  In  operation,  this  involves  classification  being  used  for  model 
development  using  the  library  at  hand  with  verification  examined  when  new  devices 
attempt  to  claim  the  identity  of  a  known  device.  These  concepts  extend  from  the 
biometrics  concepts  of  enrollment,  collecting  templates  from  users;  verification, 
validating  a  user’s  identity  through  comparison  with  that  user’s  template;  and 
identification,  recognizing  a  user  by  searching  the  entire  database  [6], 

3.4.1  Classification  Performance 

RF-DNA  classification  perfonnance  generally  considers  evaluation  of  training, 
testing,  and  validation  (in  GRLVQI)  performance  of  classifier  models.  Both  the 
MDA/ML  and  the  GRLVQI  processes  were  applied  using  a  full-dimensional  (. Nf  =  729) 
RF-DNA  feature  set  extracted  from  ZigBee  emissions  collected  to  support  results  in  [91]. 
Classification  results  are  presented  in  Figure  III-6  displaying  that  MDA/ML  overall 
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outperforms  GRLVQI  while  both  show  a  general  pattern  of  high  classification  accuracy 
for  high  SNR  with  relatively  lower  classification  accuracy  at  lower  SNR.  Classification 
results  for  Z-wave  devices  are  similarly  presented  in  Figure  III-7.  Comparing 
performance  to  these  baseline  results  is  one  general  approach  used  to  evaluate 
algorithmic  performance  throughout  this  research. 


(a)  MDA/ML  (b)  GRLVQI 

Figure  III-6:  ZigBee  Full-dimensional  Baseline  Classification  Results 


(a)  MDA/ML  (b)  GRLVQI 

Figure  III-7:  Z-Wave  Full-dimensional  Baseline  Classification  Results 
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3. 4. 1.1  Classification  Performance  Assessment:  Gain  Trade-Offs 

One  basic  approach  employed  to  compare  classification  performance  between 
competing  algorithms,  or  performance  of  a  given  algorithm  for  various  settings,  is 
relative  performance  gain  Gsnr-  Consistent  with  prior  RF-DNA  works  [51],  performance 
gain  Gsnr  is  defined  herein  as  the  reduction  in  required  SNR,  expressed  in  dB,  for  the  two 
methods  under  consideration  to  achieve  a  given  average  percentage  of  correct 
classification  ( %C ).  This  definition  is  depicted  in  Figure  III-8  for  MDA/ML  and 
GRLVQI  testing  performance  of  %C  =  90%.  When  comparing  MDA/ML  and  GRLVQI, 
we  examine  performance  at  a  nominal,  arbitrary  operating  point  of  %C  =  90%.  As 
indicated  in  Figure  III-6  MDA/ML  requires  SNR  =  8.68  dB  (TNG)  and  SNR  =  8.99  dB 
(TST),  while  GRLVQI  requires  SNR  =  12.92  dB  (TNG)  and  SNR  =  12.39  dB  (TST)  to 
achieve  the  same  performance.  Thus,  for  ZigBee  MDA/ML  is  superior  and  provides  a 
gain  of  Gsnr  —  3.4  dB  (TST)  relative  to  GRLVQI. 
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Figure  III-8:  Gain  Trade  Off  Example  for  MDA/ML  (TST)  and  GRLVQI  (TST)  for 

ZigBee. 

If  one  similarly  considered  TST  results  in  Figure  III-7  for  Z-Wave,  GRLVQI  is 
seen  to  be  superior  and  yields  a  relative  MDA/ML  gain  of  GSNR  —  +3.32  dB  (TST). 
Therefore,  when  considering  classification  performance,  GRLVQI  is  a  superior  classifier 
for  Z-Wave,  while  MDA/ML  was  a  superior  classifier  for  ZigBee. 

3.4. 1.2  Classification  Performance  Assessment:  Relative  Accuracy  Percentage  (RAP) 

To  facilitate  broader  comparison  of  %C  versus  SNR  perfonnance,  a  Relative 
Accuracy  Percentage  (RAP)  metric  was  introduced  in  Bihl  et  al.  [135].  The  RAP  is 
generated  by  first  computing  the  Area  Under  Classification  Curve  ( AUCC)  values  for 
each  method  being  compared.  This  is  done  using  a  trapezoidal  approximation,  with  a 
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given  method’s  estimated  AUCCmo)  being  in  the  denominator  and  the  baseline  AUCCBase 
being  in  the  numerator 

RAP  =  AUCCM^/AUCCBase  .  (3.46) 

According  to  (3.46),  RAP  provides  the  fraction  of  AUCCM(i)  with  respect  to  AUCCBaSe 
and  enables  both  1)  a  comparison  for  methods  that  do  not  achieve  the  arbitrary  %C  > 
90%  performance  benchmark,  and  2)  a  comparison  reflecting  performance  across  all 
SNR  levels.  Interpreting  RAP  values  is  also  intuitive,  with  1 )  RAP  <1.0  indicating  that 
the  method  under  consideration  achieves  lower  %C  than  the  baseline  method,  2)  RAP  = 
1.0  indicating  that  the  method  under  consideration  achieves  %C  performance  comparable 
to  the  baseline,  and  3)  RAP  >1.0  indicating  that  the  method  under  consideration  exceeds 
baseline  %C  performance. 

For  the  ZigBee  results  in  Figure  III-6  with  MDA/ML  serving  as  the  baseline, 
AUCC Base  =  27.18  (TST),  AUCCGRLVqi  =  25.24  (TST)  and  RAP  =  0.93  indicating  that 
MDA/ML  performs  better  across  all  operating  points  than  GRLVQI.  For  Z-Wave  results 
in  Figure  III-7,  MDA/ML  AUCCBaSe=  13.32  (TST)  and  AUCCGRLVqi  =  15.06  (TST), 
yielding  RAP  =  1.13  which  indicates  that  GRLVQI  performs  better  across  all  operating 
points  when  compared  to  MDA/ML  for  Z-Wave. 

3.4.2  Device  ID  Verification 

In  essence,  device  ID  verification  is  a  form  of  conditional  classification  which 
considers  a  one-to-one  comparison  of  a  device’s  actual  identity  with  its  claimed  identity 
[19].  This  approach  approximates  a  trained  and  tested  classifier  when  examining 
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possibly  new,  previously  unseen  data.  A  device  is  considered  to  be  authentic  when  the 
posterior  probability 

P(u)|FWew)  >  t ,  (3.47) 

with  FNew  being  a  newly  observed  RF-DNA  fingerprint;  u>,  the  class  the  device  claims 
the  identify  of;  and  t  being  a  decision  threshold  [19].  Device  ID  verification  performance 
is  then  evaluated  by  plotting  Receiver  Operating  Characteristic  (ROC)  curves  over 
various  decision  thresholds  [19]. 

3. 4.2.1  Verification  Performance  Assessment:  ROC  Curves 

Consistent  with  [89]  two  error  conditions  are  evaluated:  False  Verify  Reject 
(FVR),  for  rogue  devices,  and  False  Reject  Rate  (FRR),  for  authorized  devices.  FVR  and 
FRR  are  respectively  evaluated  against  either  True  Verify  Rate  (TVR)  or  True  Rejection 
Rate  (TRR)  to  generate  ROC  performance  curves  [89],  consistent  with  the  general  ROC 
methodology  of  [306].  The  equal  error  rate  (EER)  point  on  these  ROC-like  curves  is 
either  1-TVR  for  authorized  or  1-TRR  for  rogue.  Consistent  with  prior  research,  e.g. 
[89],  verification  performance  will  be  evaluated  as  %Authorized  or  %Rogue  Rejected 
at  90%  TVR/TRR  and  10%  FVR/FRR. 

3.4. 2.2  Baseline  Verification  Performance 

When  examining  verification  perfonnance  at  18dB,  Figure  III-9  and  Figure  III- 10 
for  authentic  vs  rogue  devices,  MDA/ML  appears  to  achieve  perfect  verification,  Figure 
III-9a  and  Figure  III- 10a,  while  GRLVQI  presents  considerably  lower  verification 
perfonnance.  Therefore  improving  GRLVQI  to  make  it  a  viable  RF-DNA  algorithm  is  of 
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major  importance  to  both  ensure  multiple  competing  classifier  methods  are  vetted  for 
future  RF-DNA  research  and  to  understand  what  could  be  leading  to  this  deficiency. 
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(a)  MDA/ML  ( 1 8dB)  (b)  GRLVQI  ( 1 8dB) 

Figure  III-9:  ZigBee  MDA/ML  and  GRLVQI  full-dimensionality  authorized  device 

verification  results  baseline 
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Figure  III-10:  ZigBee  MDA/ML  and  GRLVQI  full-dimensionality  rogue  device 

verification  results  baseline 


Figure  III- 11  presents  verification  results  for  Z-wave  devices  at  20dB  using  the 
MDA/ML  classifier,  Figure  III-  11a,  and  the  GRLVQI  classifier,  Figure  III- lib.  Since 
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only  Noev=3  devices  are  in  the  Z-Wave  dataset,  only  authorized  device  results  are 
presented.  Although  Z-Wave  fingerprints  were  associated  with  higher  GRLVQI 
classification  performance,  here  one  can  see  that  verification  performance  is  better  with 
MDA/ML. 
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Figure  III-ll:  Z-Wave  MDA/ML  and  GRLVQI  full-dimensionality  authorized 

device  verification  results  baseline 


3.4.3  MDA/ML  and  GRLVQI  Baseline  Results 

Overall  classification  results  for  MDA/ML  and  GRLVQI  using  both  ZigBee  and 
Z-Wave  RF-DNA  feature  sets  are  presented  in  Table  III-4.  The  relative  RAP  and  Gain 
metrics  in  Table  III-4,  with  MDA/ML  serving  as  the  baseline  method  (highlighted  in 
grey),  illustrate  that  MDA/ML  generally  outperforms  GRLVQI  for  both  ZigBee  RF-DNA 
classification,  while  Z-Wave  achieves  generally  better  classification  performance  using 
the  GRLVQI  classifier. 


86 


Table  III-4:  Baseline  Classification  Results. 


Device 

Algorith 

M 

Data 

Set 

AUCC 

SNR  AT 
%C  =  90% 

Relative 

MDA/ML 

(TST) 

RAP 

Relative 
MDA/ML 
(TST)  Gain 
(Gsnr) 

ZigBee 

MDA/ML 

Training 

27.39 

8.68  dB 

1.01 

+0.31 

Testing 

27.18 

8.99  dB 

1.00 

0.00 

GRLVQI 

Training 

24.99 

12.92  dB 

0.92 

-3.93 

Testing 

25.24 

12.39  dB 

0.93 

-3.4 

Z-Wave 

MDA/ML 

Training 

16.39 

21.23  dB 

1.23 

+  1.68 

Testing 

13.32 

22.91  dB 

1.00 

0.00 

GRLVQI 

Training 

15.23 

19.19  dB 

1.14 

+3.72 

Testing 

15.06 

19.59  dB 

1.13 

+3.32 

For  the  ZigBee  results  in  Figure  III-6  with  MDA/ML  serving  as  the  baseline, 
AUCCsase  =  27-18  (TST),  AUCCGRLVqi  =  25.24  (TST)  and  RAP  =  0.93  indicating  that 
MDA/ML  performs  better  across  all  operating  points  than  GRLVQI.  For  Z-Wave  results 
in  Figure  III-7,  MDA/ML  AUCCBaSe=  13.32  (TST)  and  AUCCGRLVqi  =  15.06  (TST), 
yielding  RAP  =  1.13  which  indicates  that  GRLVQI  performs  better  across  all  operating 
points  when  compared  to  MDA/ML  for  Z-Wave. 

Authorized  and  Rogue  device  verification  results,  for  ZigBee,  are  presented  in 
Table  III- 5,  for  selected  SNR  operating  points.  Table  III- 5  illustrates  that  MDA/ML 
generally  achieves  higher  verification  accuracy  at  lower  SNR  than  GRLVQI.  Consistent 
with  the  ZigBee  results,  authorized  verification  results  for  Z-Wave,  are  presented  in 
Table  III-6  for  selected  SNR  operating  points,  which  again  illustrates  that  MDA/ML 
generally  achieves  higher  verification  accuracy  at  lower  SNR  than  GRLVQI. 
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Table  III-5:  ZigBee  Baseline  Device  ID  Verification  Results. 


Algorithm 

Operating 
SNR  (dB) 

Verification 

Scenario 

Verification 
Accuracy  (%) 

MDA/ML 

10 

TVR  (%) 

100 

10 

RRR  (%) 

100 

14 

TVR  (%) 

100 

14 

RRR  (%) 

100 

18 

TVR  (%) 

100 

18 

RRR  (%) 

100 

20 

TVR  (%) 

100 

20 

RRR  (%) 

100 

22 

TVR  (%) 

100 

22 

RRR  (%) 

100 

GRLVQI 

10 

TVR  (%) 

0 

10 

RRR  (%) 

8.33 

14 

TVR  (%) 

25 

14 

RRR  (%) 

47.22 

18 

TVR  (%) 

25 

18 

RRR  (%) 

63.88 

22 

TVR  (%) 

50 

22 

RRR  (%) 

75 
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Table  III-6:  Z-Wave  Baseline  Device  ID  Verification  Results. 


Algorithm 

Operating 
SNR  (dB) 

Verification 

Scenario 

Verification 
Accuracy  (%) 

MDA/ML 

10 

TVR  (%) 

0 

14 

TVR  (%) 

66 

18 

TVR  (%) 

100 

22 

TVR  (%) 

100 

GRLVQI 

10 

TVR  (%) 

0 

14 

TVR  (%) 

0 

18 

TVR  (%) 

0 

22 

TVR  (%) 

66 

3.4.4  MPA  versus  GRLVQI  in  RF-DNA  Research 

While  MDA/ML  consistently  out-performs  GRLVQI  in  both  classification  and 
verification  tasks  for  ZigBee,  Z-Wave  devices  saw  better  classification  performance 
using  GRLVIQ  and  better  verification  performance  using  MDA/ML.  It  is  therefore 
advantageous  to  consider  further  research  in  GRLVQI  developments  with  emphasis 
towards  RF-DNA  applications  because  MDA/ML  has  known  deficiencies  in  certain 
contexts. 

Firstly,  based  on  the  criteria  in  (3.10),  MDA  is  limited  when  the  number  of 
classes  exceeds  the  number  of  available  features,  a  possible  situation  if  many  devices 
were  considered  in  a  real  world  setting  where  ZigBee  networks  can  contain  up  to  65,000 
devices  [39].  However,  it  should  be  noted  that  1)  all  pattern  recognition  methods  have 
performance  issues  (accuracy  and  computationally)  as  the  number  of  classes  grows  into 
the  10s  to  100s  (let  alone  1000s)  as  seen  in  the  literature  on  “highly  multiclass”  problems, 
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c.f.  [307-314],  and  2)  linear  methods,  such  as  MDA,  are  commonly  employed  in  “highly 
multiclass”  problems  due  to  their  computational  advantages,  c.f.  [310,  315,  316]. 

Secondly,  as  seen  in  Reising  [51]  and  the  Z-Wave  dataset  results  in  Figure  III-7 
and  Figure  IH- 11,  GRLVQI  does  outperform  MDA/ML  in  some  RF  Fingerprinting 
applications.  Thirdly,  data  distributions  and  particularly  bimodality  can  cause  issues  in 
MDA  with  respect  to  finding  the  best  discriminant  direction,  as  seen  in  [317],  which  are 
logically  possible  given  the  many  varied  applications  of  RF-DNA.  Therefore,  ample 
motivation  exists  for  improving  and  furthering  the  understanding  of  GRLVQI  and 
applying  such  improvements  for  further  RF-DNA  research. 
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IV.  Dimensional  Reduction  Analysis 


Men  dig  up  and  search  through  much  earth  to  find  gold. 

-Heraclitus,  535BC  -  475BC 

Given  large  volumes  of  data  being  collected  in  many  domains,  e.g.  big  data  [3 18— 
327],  the  primary  challenge  becomes  selecting  relevant  data  features  for  a  given  task. 
Dimensional  Reduction  Analysis  (DRA)  is  therefore  of  interest  to  select  salient  subsets  of 
a  dataset  for  analysis. 

4.1  Introduction 

As  Ruskin  states  in  [328],  “For  all  books  are  divisible  into  two  classes:  the  books 
of  the  hour,  and  the  books  of  all  time,”  thus,  indicating  that  relevance  and  importance  is 
critical.  Hayek  similarly  notes  in  [329]  that  many  problems  can  be  reduced  to  logic  “...if 
we  possess  all  the  relevant  infonnation,  if  we  can  start  out  from  a  given  system  of 
preferences  and  if  we  command  complete  knowledge  of  available  means.”  Many 
datasets  contain  more  data  than  necessary  for  reliable  classification  which,  inherently, 
becomes  a  problem  that  can  be  addressed  using  DRA  to  improve  performance  after 
discarding  non-salient  features  [330].  One  concept  in  feature  selection  is  that  feature 
salience  is  linked  to  dependence  on  class  labels  [331],  therefore  feature  selection  methods 
that  result  from  classifier  model  development  (termed  post-classification)  and  methods 
that  consider  the  distribution  of  data  with  respect  to  a  class  label  vector  (e.g.  Analysis  of 

Variance  (ANOVA)  based  F-test)  are  of  particular  interest. 
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This  chapter  is  organized  as  follows.  Section  4.2  presents,  develops  and  discusses 
various  DRA  methods,  with  Section  4.2.1  discussion  pre-classification  DRA  methods, 
Section  4.2.2  discussion  post-classification  DRA  methods,  Section  4.2.3  developing 
MDA  based  DRA  methods,  Section  4.2.4  discussing  DRA  fusion,  Section  4.2.5 
discussing  Random  DRA  as  a  baseline  method,  and  Section  4.2.6  discussion 
dimensionality  assessment  methods.  Section  4.3  then  considers  Multiple  Discriminant 
Analysis  (MDA)  models  and  ZigBee  RF-DNA  features  to  assess  various  DRA  methods 
for  device  discrimination,  including  both  Device  Classification  (1  vs.  Nc  assessment)  and 
Device  ID  Verification  (1  vs.  1  assessment). 

4.2  Dimensional  Reduction  Analysis  Methods 

DRA  can  consist  of  many  processes  and  actions;  at  the  highest  level,  DRA  is 
considered  to  embody  three  aspects:  dimensionality  assessment  (< qualitative  versus 
quantitative),  feature  selection  versus  feature  extraction,  and  pre-classification  versus 
post-classification.  The  following  describe  higher  level  aspects  of  DRA: 

1.  Pre-classification  versus  post-classification :  The  distinction  between  pre¬ 
classification  and  post-classification  DRA  involves  where  in  the  overall 
pattern  recognition  process  the  DRA  is  perfonned.  Pre-classification  DRA 
involves  any  method  performed  a  priori  of  any  classification  step,  e.g.  input 
data  distribution-based  methods,  while  post-classification  DRA  is  perfonned 
a  posteriori  of  the  classification  step  and  includes  information  from  the 
classifier  on  feature  relevance,  e.g.  MDA  loadings  [237,  pp.  394-429]  [242], 
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Artificial  Neural  Network  -  Signal  to  Noise  Ratio  (ANN-SNR)  feature 
screening  [330]  and  Relevance  Learning  Vector  Quantization  (RLVQ) 
methods  [51].  Pre-classification  DRA  is  also  known  as  filter  methods  and 
post-classification  is  also  known  by  the  tenn  embedded  or  wrapper  methods 
[332,  333].  Since  pre-classification  DRA  is  not  directly  tied  to  classifier 
perfonnance  it  does  not  necessarily  improve  classifier  performance,  as  seen 
in  [334]. 

2.  Feature  selection  versus  feature  extraction :  consistent  with  [213,  335], 
feature  selection  involves  selecting  subsets  of  existing  features  through  pre¬ 
classification  or  post-classification  feature  relevance  rankings,  while  feature 
extraction  involves  a  data  transformation  into  either  a  lower  dimensional 
space  or  a  transfonned  space,  e.g.  the  RF  Distinct  Native  Attribute  (RF- 
DNA)  Fingerprinting  Process  itself,  MDA,  or  Principal  Component  Analysis 
(PCA).  Feature  selection  is  relevant  throughout  many  domains  including 
multivariate  statistics  to  manufacturing  [336], 

3.  Dimensionality  Assessment :  DRA  also  involves  an  operator  judgment  on  the 
amount  of  data  to  retain.  Both  qualitatively  and  quantitatively  dimensionality 
assessment  methods  can  be  used.  Quantitative  dimensionality  assessment 
computationally  determines  the  amount  of  data  or  what  features  to  retain, 
whereas  qualitative  dimensionality  assessment  involves  subjective  selection 
of  the  quantity  of  features.  In  some  application,  subject  matter  expertise  can 
be  leveraged  for  qualitative  dimensionality  assessment  [89,  91,  113]  where 
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subjective  amounts  of  features  were  retained.  Quantitative  dimensionality 
assessment  methods  are  considered  here  using  heuristics  on  data  covariance 
matrix  eigenvalues,  MDA  loadings,  and  test  statistic  p-values. 

Excluding  the  RF-DNA  Fingerprinting  feature  extraction  process  itself,  described 
in  Section  II,  prior  DRA  research  for  RF-DNA  has  considered  three  feature  selection 
methods:  A)  a  pre-classification  distribution-based  two-sample  Kolmogorov-Smirnov 
goodness-of-fit  test  (KS-test),  B)  a  post-classification  GRLVQI  feature  relevance 
rankings  process  [91],  and  C)  the  post-classification  Random  Forest  feature  relevance 
rankings  process  [134],  While  all  three  approaches  have  seen  success  in  RF-DNA 
applications,  logically  DRA  methods  associated  with  classification,  e.g.  post- 
classification,  should  be  associated  with  improved  classification  perfonnance. 

Of  particular  interest  to  this  research  were  methods  that  could  be  used  to 

1.  improve  and  expand  the  RF-DNA  DRA  foundation  by  improving  the 
understanding  of  the  KS-test  DRA  algorithm,  which  involves  understanding 
the  appropriate  use  of  p-values  and  test  statistics  for  feature  relevance 
ranking, 

2.  extend  the  distribution-based  one-way  ANOVA  F-statistic  method  to  RF- 
DNA, 

3.  compare  and  contrast  dimensionality  assessment  approaches, 

4.  aid  development  of  an  MDA-based  DRA  algorithm, 

5.  compare  with  GRLVQI  feature  relevance  ranking,  and 
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6.  aid  development  of  DRA  fusion  approaches  to  combine  multiple  feature 
relevance  ranking  approaches. 

4.2.1  Distribution  Based  Feature  Selection  DRA 

Distribution-based  pre-classification  feature  selection  for  RF-DNA  considers 
either  data  feature  distributions  with  respect  to  class  membership  or  data  feature 
distributions  against  other  features.  Both  approaches  are  considered  herein  using  the 
two-sample  KS-test  and  the  F-statistic.  Additionally,  of  particular  interest  is 
understanding  whether  test  statistic  values  or  probabilities  (/;- values)  from  the  tests  are 
best  for  achieving  reliable  feature  relevance  ranking. 

4.2. 1.1  Two  Sample  Kolmogorov-Smirnov  (KS)  Test 

The  KS-test  was  codified  by  Massey  [337]  based  on  independent  contributions  by 
Kolmogorov  [338]  and  Smirnov  [339].  The  KS-test  is  a  distribution-based  goodness-of- 
fit  process  for  comparing  the  distribution  of  a  given  sample  vector  (x*)  with  a  given 
reference  distribution  [337].  The  two  sample  KS-test  is  an  extension  that  quantifies 
differences  in  cumulative  distribution  functions  for  two  sample  vectors  ( x1  and  x2)  using 
a  test  statistic  of  the  fonn, 

KS  =  maxdF^x)  —  F2(x)  |)  (4.1) 

where  F1  (x)  is  the  proportion  of  x1  values  less  than  or  equal  to  x,  F2  (x)  is  the  proportion 
of  x2  values  less  than  or  equal  to  x,  and  KS  is  the  computed  test  statistic  value  [337,  340, 
341].  With  the  test  statistic,  KS,  being  the  maximum  difference  between  the  curves,  if 
Xjand  x2  come  from  the  same  distribution,  the  value  of  KS  converges  to  zero.  Higher 
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values  of  KS  indicate  different  distributions  while  lower  KS  values  indicate  similar 
distributions  [337;  340,  pp.  344-385]. 

For  detennining  /7-values,  the  underlying  KS-test  null  hypothesis  is  that  x1  and  x2 
are  from  the  same  distribution  and  the  alternative  hypothesis  that  they  are  from  different 
distributions  [337,  340],  For  the  KS-test,  data  degrees  of  freedom  ( DoF)  and  the  null 
distribution  are  used  to  compute  /7-values,  with  /7-values  ranging  from  0  to  1  [340]. 
Additionally,  KS-test  /7-values  can  identically  equal  0  [340].  Although  not  mentioned  in 
[91,  134,  241]  and  largely  automated  in  practice,  the  process  for  computing  approximated 
KS-test  /^-values  is  rather  involved  and  requires  first  computing 


c;  -  max  | 


V  Ne  +  0.12  + 


0.11 


KS,  0  , 


(4.2) 


where  KS  is  the  KS-test  statistic  value  from  (4.1)  and 

N ±N2 


Ne  = 


(4.3) 


N1  +  N2 

which  represents  the  Harmonic  mean  [283]  of  the  number  of  observations  in  Group  1 
( Ni )  and  Group  2  (A?)  [342,  pp.  623-628].  To  compute  the  KS-test  /7-value,  the 
following  function  is  used 


Pest=  2^(— lr'e-2'2'2.  (4.4) 

i= 1 

with  the  final  approximation  of  the  /7-value  then  computed  as 

p  —  min(max(pest,  0),  1)  ,  (4.5) 

where  the  min  and  max  functions  ensure  the  estimate  is  bounded  between  0  and  1  [337; 
342,  pp.  623-628;  343-345] . 

96 


Feature  selection  using  the  two  sample  KS-test  was  first  proposed  by  Nechval 
[346]  in  1988  for  image  processing  and,  prior  to  Dubendorfer  [91],  the  KS-test  saw 
limited  DRA  application  with  only  one  additional  citation  [347].  For  RF-DNA  DRA, 
KS-test  /^-values  have  seen  many  applications  [89,  1 13,  134,  348].  For  DRA,  the  KS-test 
is  implemented  pairwise  in  each  feature  by  classes,  where  one  should  logically  seek  xx 
and  x2  from  different  distributions  to  avoid  redundancy  [113,  121],  For  multiple  classes, 
pairwise  KS-test  /7-values  are  computed  for  each  feature  and  then  summed  [91]. 

The  formulation  of  the  KS-test  DRA  algorithm  in  Figure  IV- 1  is  based  on  Patel’s 
[134]  work  and  was  revised  here  to  include  both  A)  the  logical  inequality  of  i  ±  j  to 
ensure  it  is  clear  that  only  non-identical  vectors  are  compared,  and  B)  the  correct 
inclusion  of  the  test  statistic  from  which  the  /7-value  is  computed.  The  algorithm 
iteratively  considers  each  feature  via  a  pairwise  comparison  of  the  feature  per  class. 


Figure 


Algorithm  1  KS-Test  for  Feature  Selection 
for  Each  feature  v  =  1  — »  NF  do 
for  i  =  1  — *  Nc  classes  do 
for  j  =  1— ►  Nc  classes  do 

if  i  ^  j  do 

Xj  =  observations  from  class  i,  variable  v 
Xj  =  observations  from  class  j,  variable  v 
KS  =  max(|Fi(Xi)  -  Fj(xj) |) 
p(v)  =  p(v)  +  p(KS,  DoF) 

end  if 
end  for 
end  for 
end  for 

V-l:  p-value  KS-test  Feature  Selection  Algorithm  as  adapted  from  Patel 
[134]  and  modified  herein. 
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Figure  IV-2  presents  resultant  summed  /7-values  for  ZigBee  features  using  the 
algorithm  in  Figure  IV- 1.  Results  in  this  figure  are  consistent  with  observations  made  in 
[113,  121],  i.e.  phase  (0)  features  (indices  244  to  486)  are  collectively  the  most  relevant 
(smaller  /> values)  when  compared  to  amplitude  (a)  features  (indices  1  to  243)  and 
frequency  (/)  features  (indices  487  to  729).  However,  it  is  evident  in  Figure  IV-2  that  a 
majority  of  features  have  very  low  (less  than  0.1)  summed  /?-values  which  may  result  in 
low  feature  selection  resolution  due  to  minute  differences  between  relevance  ranking 
values. 


Feature  Index  Number 


Figure  IV-2:  Sum  of  /7-values  from  pairwise  KS-test  on  ZigBee  training  observations 
using  a  full-dimensional  (NF  =  729)  feature  set  at  SNR  =  10  dB  [89, 113].  Lower 
values  indicate  greater  feature  difference  and  potentially  greater  relevance. 
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Figure  IV-3  presents  the  corresponding  mean  test  statistic  values  for  the  /7-values 
seen  in  Figure  IV-2.  Again,  as  in  Figure  IV-2,  Figure  IV-3  shows  that  phase  ( <p )  features 
are  most  relevant  (higher  test  statistic  values).  Incidentally,  the  /7-values  in  Figure  IV-2 
trend  toward  zero  while  the  test  statistic  values  in  Figure  IV-3  do  not  trend  to  any  single 
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Figure  IV-3:  Mean  of  test  statistic  values  from  pairwise  KS-test  on  ZigBee  training 
observations  using  a  full-dimensional  (NF  =  729)  feature  set  at  SNR  =  10  dB. 
Higher  values  indicate  more  different  (and  possibly  more  relevant)  features. 
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4.2. 1.2  One  Way  Analysis  of  Variance  F-Statistic 

Although  previously  unexplored  for  RF-DNA  feature  selection,  feature  ranking 
by  F-statistic  values  from  one-way  ANOVA  was  first  reported  by  Habbema  and  Hermans 
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[349]  and  has  seen  further  application  in  medical  [350],  education  data  analysis  [351], 
and  other  DRA  applications  [333,  352].  The  underlying  premise  of  F-statistic  based 
DRA  involves  selecting  features  that  provide  a  good  relationship  to  the  class 
membership,  with  the  process  echoing  Hall  and  Smith’s  [353]  advice  that  “a  good 
predictor  set  should  contain  features  highly  correlated  with  the  target  class  distinction, 
and  yet  uncorrelated  with  each  other.” 

ANOVA  considers  a  linear  model  which  expresses  the  relationships  between 
parameters  as 


Y  —  XB  +  e  ,  (4.6) 

where  Y  is  a  continuous  response  variable  (each  feature  herein),  X  is  an  input  variable 
(categorical  vector  of  class  identities  herein),  B  are  the  solved  parameters,  and  £  is  a 
vector  of  iid  assumed  errors  [302,  354,  355].  ANOVA  employs  the  linear  model  in  (4.6) 
to  understand  variability  in  observations  through  sum  of  squares  computations  of  the 
observation  from  their  mean  and  sum  of  squares  associated  from  observational  groups 
[302], 


The  F-test  is  a  heuristic  used  to  compute  the  significance  of  an  ANOVA 
relationship,  and  is  defined  as 

MS/yiodel 


Fn  = 


(4.7) 


MSE  ’ 

where  MSmodei  is  the  mean  square  for  a  given  general  linear  model  between  X  and  Y,  and 
MSE  is  the  mean  squared  error  in  a  computed  linear  ANOVA  model  [302], 
Traditionally,  for  ANOVA  problems,  /^-values  are  computed  from  the  F-test  and  used  to 
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determine  if  a  relationship  is  significant  or  not  for  the  null  hypothesis  that  there  is  no 
relationship  between  X  and  Y  [302].  When  considered  as  a  feature  selection  problem, 
higher  values  of  F0  are  taken  to  indicate  that  a  feature  is  more  likely  to  be  useful  in 
discriminating  between  classes  [350].  To  compute  the  /7-value,  the  F-distribution  is  used, 
which  has  a  probability  density  function, 


f{x\u,v)  = 


,(u+ v)/2 


(4.8) 


with  u  and  v  being  the  respective  Degrees  of  Freedom  (DOF)  for  the  numerator  and 
denominator  terms  in  (4.7)  [302],  For  RF-DNA  application,  u  is  the  DOF  due  to  groups 
(. Nc  -  1)  and  v  is  the  DOF  due  to  the  number  of  observations  (NTng  -  u  -  1 ).  Figure  IV-4 
presents  the  F-distribution  computed  for  the  entirety  of  the  ZigBee  training  data,  with  u  = 
3  and  v  =  4796.  The  x-axis  is  in  units  of  F-statistic  value,  as  computed  by  (4.7),  and  the 
y-axis  is  the  f-distribution  value,  as  computed  by  (4.8)  [302].  F- values  are  then 
computed  by  finding  the  area  under  the  curve  (AUC)  at  a  given  F-statistic  value;  these  p- 
values  are  either  one-sided  (upper  or  lower  tail)  or  two  sided  (both  the  upper  and  lower 
tail)  [302].  For  illustrative  purposes,  a  two-sided  test  is  used  as  this  is  what  was  used  in 
practice.  Further  discussion  of  one-sided  or  two-sided  test  can  be  found  in  Montgomery 
and  Runger  [302], 
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Figure  IV-4:  Example  /7-value  computation  from  test  statistics  using  an  F- 

distribution. 

Figure  IV- 5  presents  an  algorithm  for  feature  relevance  ranking  using  a  one-way 
ANOVA  F-test.  Here,  both  test  statistics  and  /7-values  are  computed  for  each  feature  of 
the  training  data  with  respect  to  a  corresponding  class  vector  since  [349-351]  employed 
test  statistics,  and  not  the  /7-values,  for  feature  relevance  ranking. 


Algorithm  2  F-Test  Feature  Selection  Algorithm 

for  Each  feature  i  =  1  —*■  Np  do 

X/  =  observations  from  class  i,  variable  i 
y  =  vector  of  class  identification 
F-test  Stat  =  MSModel/MSError 
p(i)  =  p (F-test  stat,  DoF) 
end  for 

Figure  IV-5:  One  way  ANOVA  F-test  Feature  Relevance  Ranking  Algorithm. 


Figure  IV-6  and  Figure  IV-7  present  the  test  statistic  and  /7- values,  respectively,  at 
SNR  =  1 0  dB  after  employing  Algorithm  2  on  the  ZigBee  RF-DNA  data.  Consistent  with 
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the  KS-test,  smaller  /7-values  in  Figure  IV-7  are  again  considered  as  more  relevant. 
Comparing  Figure  IV-6  and  Figure  IV-7,  here  one  can  see  that  both  test  statistics  and  p- 
values  indicate  that  phase  features  are  more  relevant;  however,  one  can  also  see  that  the 
/7-values  trend  towards  zero  while  test  statistic  values  do  not. 
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Figure  IV-6:  Test  statistic  values  from  F-test  on  ZigBee  training  observations  using 
a  full-dimensional  (NF  =  729)  feature  set  at  SNR  =  10  dB.  Lower  values  indicate 
greater  feature  difference  and  potentially  greater  relevance. 
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Figure  IV-7:  /7-values  from  F-test  on  ZigBee  training  observations  using  a  full¬ 
dimensional  (NF  =  729)  feature  set  at  SNR  =  10  dB.  Lower  values  indicate  greater 
feature  difference  and  potentially  greater  relevance. 


4.2. 1.3  Test  Statistic  versus  P-values  for  Feature  Relevance  Rankins 

Test  statistic  values  are  commonly  converted  to  /7-values  (probabilities)  to  assess 
significance  [302].  P-values  are  generally  considered  as  the  smallest  level  at  which  an 
observed  test  statistic  value  is  significant  [356].  However,  the  appropriate  use  and  the 
general  appropriateness  of  /7-values  in  statistics  are  associated  with  much  debate.  This  is 
inherently  related  to  the  meaning  of  a  /7-value  [357].  For  feature  relevance  ranking, 
various  studies  consider  /7-values,  c.f.  [89,  113,  121,  358-361],  and  many  backward  and 
forward  selection  methods  employ  /7-values  for  feature  selection  [362,  363].  KS-test  p- 


values  were  also  used  by  Wendt  et  al.  [364]  to  compare  similarities  of  distributions 
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between  different  foundries.  However,  others  advocate  the  use  of  the  test  statistic  itself 
[349,351,365], 

Due  to  this  disagreement  in  literature,  an  understanding  of  the  use  of  /7-values  and 
test  statistic  values  is  needed.  To  facilitate  this,  a  philosophical  understanding  of  /? -value 
and  test  statistics  is  first  formulated,  then  a  short  description  of  the  relative  steps  required 
to  compute  KS-test  and  F-test  /7-values,  this  is  followed  by  an  empirical  understanding  of 
/^-values  and  test  statistic  values  for  DRA. 

(a)  General  Understanding  of  P-value  Use  and  Misuse 

Essentially,  a  /7-value  is  a  reflection  of  a  computed  test  statistic  value  given  a 
probability  distribution  and  for  a  specific  null  hypothesis  [366].  When  computed,  the  p- 
values  indicate  the  probability  of  observing  a  given  result  given  the  reference  distribution 
and  the  specified  null  hypothesis  [367,  368].  Hence  a  /7-value  is  only  meaningful  in  the 
context  of  a  given  scenario  [369],  and  to  compute  any  /7-value  one  necessarily  needs  the 
following  quantities:  a  hypothesis  test,  data  degrees  of  freedom,  a  reference  probability 
distribution,  a  test  statistic  result,  and  a  hierarchy  of  possible  outcomes  [367].  However, 
these  are  not  always  stated  in  feature  relevance  ranking  applications,  c.f.  [89,  113,  121, 
241,  370],  and  thus  resultant  /7-value  results  are  often  presented  out  of  context. 

While  test  statistic  values  and  /7-values  largely  move  in  opposite  directions 
(smaller  /7-values  indicate  larger  test  statistic  values),  the  mapping  is  rarely  linear  and  is 
associated  with  various  properties  of  the  reference  distribution.  Test  statistics  are  often 
ratios  of  data  dependent  quantities  while  /7-values  refer  to  the  probability  of  getting  that 
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value  which  involves  assumptions  with  respect  to  a  distribution.  Various  issues  therefore 
exist  when  using  /7-values  for  feature  relevance  ranking  as  noted  by  Cord  et  al.  [365]. 
When  interpreting  /7-values,  differences  in  /;- values  can  result  from  differences  in  effects 
sizes  and/or  differences  in  standard  errors  [371],  and  thus  using  /7-values  as  a  quantifiable 
value  is  considered  a  logical  fallacy  of  the  transposed  conditional  [372].  P-values  are 
additionally  viewed  as  imprecise  and  debate  exists  on  whether  approximate  /;- values  are 
more  useful  than  exact  values  [373]. 

Additionally,  using  /;- values  for  feature  relevance  ranking  appears  akin  to  issues 
mentioned  in  Anderson  et  al.  [374]  where  /;- value  magnitudes  were  shown  to  offer 
possibly  erroneously  interpretation  of  effect  size.  Other  problems  exist  in  that  small  p- 
values  can  be  computed  due  to  either  low  variability  or  large  sample  sizes  [374].  For 
example,  Kitbumrungrat  [375]  considered  MDA  as  a  classifier  and  presented  feature 
relevance  ranking  values  for  an  MDA-based  DRA  method,  F-test,  and  /;- values;  while  the 
/^-values  were  all  essentially  equal,  the  other  methods  presented  different  relevant  ranking 
values  for  each  feature. 

The  larger  question  also  exists  on  whether  /7-values  are  appropriate  for  feature 
relevance  ranking;  this  particularly  revolves  around  the  issue  of  treating  /7-values  as 
exacts  when  /7-values  of  similar  magnitude  are  essentially  equivalent  [369].  While  one 
can  point  to  many  feature  selection  methods,  such  as  forward/backward/stepwise 
regression,  as  using  /7-values  for  feature  selection  [354],  using  p-values  for  feature 
relevance  ranking  is  not  without  controversy,  c.f.  [365,  376]. 


106 


Some  disagreement  also  exists  in  statistics  literature  on  if  it  is  appropriate  to  even 
use  /^-values  for  traditional  hypothesis  testing  purposes,  e.g.  [357,  369,  377-390],  with 
some  journals  even  refusing  to  publish  /7-values  from  hypothesis  tests,  e.g.  Epidemiology 
[391]  and  Basic  and  Applied  Social  Psychology  (BASP)  [377].  While  some  of  this  debate 
involves  debates  between  Bayesian  and  Frequentists  statisticians  [392],  further  issues 
involve  the  incorrect  application  of  /7-values,  as  Senn  [393]  stated,  “/7-values  are  a 
practical  success,  but  a  critical  failure,”  and  issues  relating  to  sample-to-sample  /7-value 
variability  and  the  influence  of  sample  size  [369]. 

Summation  and  many  other  methods  used  to  combine  p-values  may  present  some 
difficulties  due  to  an  implicit  assumption  that  p-values  are  the  result  of  independent  tests. 
How  to  properly  combine  /7-values  is  another  issue  and  a  variety  of  methods  for  differing 
conditions  therefore  exist,  c.f.  [394-403],  However,  in  prior  RF-DNA  applications,  c.f. 
[89,  113,  121],  summed  /7-values  were  not  directly  interpreted  as  probabilities,  thus  the 
chance  for  misinterpretation  may  not  exist.  Although,  many  of  the  steps  listed  in 
Sections  4.2. 1.1  and  4. 2. 1.2  to  compute  either  KS-test  of  F-test  /7-values  are  automated, 
these  are  implicit  steps  that  cannot  be  ignored  when  employing  a  process.  Additionally, 
by  considering  the  steps  needed  to  compute  their  respective  /7-values,  we  can 
conceptualize  the  issues  that  exist  in  /7-value  feature  relevance  ranking  in  the  KS- test  and 
F-test. 

In  summary,  the  various  issues  related  to  /7-values  for  DRA  include: 

1.  Resolution  is  lost  in  the  mapping  from  the  test  statistic  to  the  (typically) 
nonlinear  /7-value. 
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2.  P-values  are  imprecise  [373]. 

3.  Computing  /7-values  involves  an  implicit  distributional  assumption 
whereas  test  statistics  are  often  only  ratios. 

4.  That  /> values  frequently  converge  to  zero  for  large  quantities  of  samples 
[369]. 

5.  An  additional  and  unnecessary  computation  is  required  in  looking  up  the 
associated  /7-value  for  a  given  test  statistic,  hypothesis  test,  degrees  of 
freedom  and  distribution. 

6.  Fundamentally,  /7-values  indicate  statistical  significance,  but  nothing  about 
the  magnitude  of  that  statistical  significance  [404-406]. 

7.  Prior  to  computing  test  statistic  values,  one  is  not  making  an  explicit 
distributional  assumption,  but  one  must  make  a  distributional  assumption 
when  computing  a  p-value.  An  example,  the  experimentally  computed  F- 
test  statistic  value  in  (4.7)  is  merely  a  ratio  of  sums  of  squares.  While 
tenning  (4.7)  an  “F-test  statistic”  does  imply  an  F-distribution,  until  one 
formalizes  a  hypothesis  test  and  computes  the  /7-values,  no  distributional 
assumption  has  been  made  since  there  are  no  distributional  assumptions 
with  general  linear  models  prior  to  these  steps  [407].  Therefore,  test 
statistic  values  are  generally  ratios,  but  do  not  indicate  any  underlying 
inferences,  or  significance,  of  these  values  until  they  are  tied  to  a 
hypothesis  test  and  reference  distribution. 
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(b)  P-value  Versus  Test  Statistic  Feature  Relevance  Rankins 

Beyond  literature  references  regarding  /7-values,  it  is  useful  to  empirically 
evaluate  the  p-values  for  feature  relevance  ranking.  This  is  considered  below  for  both  the 
KS-test  and  F-test  on  the  ZigBee  RF-DNA  Fingerprint  data,  and  further  in  Appendix  C 
on  general  academic  datasets.  As  seen  in  Figure  IV -4,  the  resulting  /7-value  from  a  given 
test  statistic  involves  firstly  an  additional  computational  step  and  secondly  a  nonlinear 
mapping.  As  one  can  visualize,  the  AUC  will  nonlinearly  vary  as  a  given  test  statistic 
may  linearly  vary,  inherently  making  comparison,  ranking,  and  interpretation  more 
difficulty.  Additionally,  F-test  /7-values  may  not  offer  comparison  of  features  from 
multiple  datasets  since  the  underlying  probability  distribution  changes  as  the  degrees  of 
freedom  change. 

To  examine  the  distributions  of  the  /7-values  and  test  statistic  values  for  the  F-test 
and  KS-test,  histograms  of  unit  area,  using  the  same  bin  centers  and  bin  widths,  are  used. 
Figure  IV-8  presents  summed  /7-values  from  the  KS-test,  while  Figure  IV-9  presented 
mean  test  statistic  values  from  the  KS-test.  Four  operating  points,  SNR  =  [0,  10,  18,  30] 
dB  are  used  in  both  Figure  IV-8  and  Figure  IV-9.  Overall,  both  Figure  IV-8  and  Figure 
IV-9  illustrate  that  features  become  more  statistically  significant  in  the  KS-test  as  noise 
diminished  with  /7-values  approaching  0  as  the  underlying  null  hypothesis  is  rejected. 
However,  conditions  exist  where  all  features  could  be  viewed  as  significant  if  only  p- 
values  feature  ranking  were  used.  For  instance,  at  SNR=  10  dB  two  features  have  a 
summed  /7-value  equal  to  exactly  0,  and  at  SNR  =  30  dB,  99.7%  of  the  features  are  in  the 
first  bin  (centered  at  0.0108)  with  12%  of  the  features  having  a  p-value  exactly  equal  to  0 
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and  thus  of  equivalently  relevant.  This  issue  of  resolution  exists  even  at  SNR  =  0  dB, 
where  a  large  number  of  features  have  very  low  /^-values. 

While  feature  relevance  resolution  was  lost  when  using  /> values,  as  seen  in 
Figure  IV-8,  resolution  is  not  lost  when  using  test  statistic  values,  Figure  IV-9.  The  result 
in  Figure  IV-9  thus  illustrates  that  KS-test  statistic  values  offer  a  more  refined  and 
consistent  approach  for  finding  and  selecting  features  which  is  not  overwhelmed  by  the 
numerous y>-value  issues  as  described  in  Section  4.2. 1.3  and  visualized  in  Figure  IV-8. 


P -value 


Figure  IV-8:  Normalized  histogram  of  summed  pairwise  KS-test  p-values  using  a 
full-dimensional  ( NF  —  729)  ZigBee  TNG  feature  set  for  varying  SNR  =  [0, 10, 18, 

30]  dB. 
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Figure  IV-9:  Normalized  histograms  of  mean  pairwise  KS-test  statistic  values  using 
a  full-dimensional  ( NF  —  729)  ZigBee  TNG  feature  set  for  SNR  =  [0, 10, 18,  30]  dB. 


Figure  IV- 10  and  Figure  IV- 11  consider  the  F-test  /^-values  and  test  statistic 
values,  respectively,  through  normalized  histograms  and  the  same  bin  widths  as  in  Figure 
IV-8.  Figure  IV-10  and  Figure  IV-11  show  a  similar  distributional  issue  for  F-test  p- 
values,  where  /7-values  are  converging  on  0  whereas  the  F-test  statistic  values  do  not 
converge  to  any  one  number. 


Ill 
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Figure  IV-10:  Normalized  histogram  of  F-test  p-values  using  a  full-dimensional 
( Nf  —  729)  ZigBee  TNG  feature  set  for  varying  SNR  =  [0, 10, 18,  30]  dB. 
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Figure  IV-11:  Normalized  histograms  of  F-test  statistic  values  using  a  full¬ 
dimensional  (Nf  =  729)  ZigBee  TNG  features  for  SNR  =  [0, 10, 18,  30]  dB. 

Table  IV-1  condenses  the  results  of  Figure  IV-8  and  Figure  IV-10  by  illustrating 
that  /7-values  trend  towards  0,  or  indistinguishable  numbers,  as  SNR  increases.  The 
general  estimated  decimal  relative  spacing  between  values  of  2.22xl0'16  per  [408],  was 
used  for  this  computation.  Table  IV-1  thus  indicates  that  increasing  signal  strength 
corresponds  to  increasing  significance.  This  result  further  mirrors  that  of  /7-values 
trending  towards  0  in  [365]. 
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Table  IV- 1:  Quantity  of  ZigBee /7-values  Less  Than  or  Equal  to  64-bit  Relative 

Spacing,  from  [49], 


Method 

SNR 

0  dB 

10  dB 

18  dB 

30  dB 

F-test  p-values 

12 

328 

573 

635 

KS-test  summed  p- values 

0 

122 

397 

679 

Table  IV-2,  adapted  from  Bihl  et  al.  [49],  further  examines  /7-value  and  test 
statistic  for  ZigBee  RF-DNA  features  the  top  5  and  bottom  2  ranked  (by  respective  test 
statistic  value)  at  SNR=  10  dB.  Values  in  Table  IV-2  are  ranked  by  respective  test 
statistic  values  for  both  F-test  and  KS-test,  with  the  corresponding  /7-values.  The  728th 
and  729th,  lowest  ranked  values  illustrate  the  scale  of  the  values.  While  machine 
precision  values  are  a  continuum  which  rarely  converge  to  any  single  number,  noticeably 
many  /7-values  are  below  the  decimal  relative  spacing  of  2.22xl0'16  [408],  and  are  thus 
notionally  equivalent  and  equal  to  0  for  computing  mean  and  variance.  Evident  in  Table 
IV-2  is  that  ranking  values  equivalent  to  0  may  not  provide  a  consistent  means  for 
ranking  features  and  could  be  less  effective  when  selecting  a  low  number  of  features. 
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Table  IV-2:  P-values  vs  Test  Statistic  for  ZigBee  at  SNR  =  10  dB  Ordered  by 
Decreasing  F-test  and  KS-Test  Statistic  Value,  adapted  from  [49] 


Feature 

F-Test 

KS-Test 

Test 

Statistic 

P-VAUUE 

Summed 

test 

STATISTIC 

Summed  p- 

VAUUE 

1 

542.64 

1.22-1  O'303 

3.316 

3.71-1 0"94 

2 

471.78 

1.29-1  O'268 

3.251 

0 

3 

432.97 

6.38-10'249 

3.242 

6.39- 10'97 

4 

424.26 

1.88-10'244 

3.169 

9.79- 10'98 

5 

420.74 

1.22-1  O'242 

3.053 

1.90-10'61 

728 

0.280 

0.839 

0.164 

2.18 

729 

0.043 

0.988 

0.150 

2.67 

Variance 

6,324.8 

0.0094 

0.2417 

0.0646 

Feature  selection  via  ^-values  therefore  has  considerable  issues.  Further  issues 
are  illustrated  in  Appendix  C  where  various  academic  datasets  are  considered  through  the 
KS-Test  and  F-test  DRA  methods.  For  both  RF-DNA  DRA  and  the  academic  datasets  in 
Appendix  C,  test  statistic  values  are  seen  to  not  converge  on  any  specific  number  and 
thus  they  offer  a  more  natural  tool  for  feature  comparison  than  /;- values.  Employing  test 
statistic  values  for  DRA  is  also  consistent  with  the  F-statistic  DRA  method  formulated  in 
[349].  As  noted  in  Section  4.2.1.3(a),  computing  and  interpreting /i-values  also  involves 
further  issues.  Further  comparisons  of  /;- values  versus  test  statistic  values  will  be  made 
via  classification  and  verification  perfonnance. 
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4.2.2  Post-Classification  Feature  Selection  DRA  for  RF-DNA 


Model  based  feature  selection  methods  involve  computing  a  feature  ranking  as  a 
byproduct  or  result  of  a  classification  model  building  process.  Prior  RF-DNA  research 
has  considered  only  GRLVQI  feature  relevance  ranking  and  Random  Forest  as  post¬ 
classification  DRA  methods.  Although  the  MDA  classifier  has  seen  much  use  in  RF- 
DNA  applications,  noticeably  missing  in  previously  applied  DRA  methods  are  MDA- 
based  DRA  methods.  This  absence  is  due  to  the  assumption  that  MDA-based  post¬ 
classification  DRA  was  not  directly  possible  [51,  91,  134].  However,  various  MDA- 
based  DRA  methods  do  exist  in  literature,  e.g.  [242,  351,  409],  and  these  are  further 
developed  herein  for  application  to  RF-DNA.  MDA  based  feature  relevance  ranking 
methods  are  considered  and  described  below,  including  Wilk’s  Lambda,  which  examines 
the  scatter  matrices  of  MDA;  Discriminant  Weights,  which  are  raw  eigenvalues  of  the 
MDA  matrices;  and  Discriminant  Loadings,  the  correlation  of  the  eigenvectors  of  MDA 
with  the  original  data. 

4.2. 2.1  GRL  VQI  Feature  Relevance  Rankins 

As  discussed  in  Section  III,  GRLVQI  feature  relevance  scores,  y/,  provide  a 
model-based  indication  of  feature  contribution  to  GRLVQI  classifier  development 
process  [244-246,  266].  Prior  work  [89,  113]  demonstrated  y/  values  offering 
comparable  performance  to  KS-test  /> value  ranking  for  ZigBee  feature  selection  with 
multiple  discriminant  analysis  (MDA).  Figure  IV- 12  examines  GRLVQI  relevance 
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scores,  i //,  plotted  by  feature  index  number.  Consistent  with  the  KS-test  and  F-test  DRA 
methods,  GRLVQI  relevance  scores  again  show  phase  features  as  the  most  relevant. 


Figure  IV-12:  Feature  ranking  using  GRLVQI  relevance  values  using  full¬ 
dimensional  Nf  =  729  ZigBee  TNG  observations  at  SNR  =  10  dB. 

4.2. 2. 2  MDA  Based  Feature  Selection 

Various  methods  of  feature  relevance  ranking  are  implicit  in  MDA  and  can  be 
determined  relatively  simply.  Primarily,  these  methods  involve  ratios  between  scatter 
matrices  and  examining  the  discriminant  functions  themselves.  Three  general  methods 
for  MDA  post-classification  DRA  will  be  considered:  Wilk’s  Lambda,  Discriminant 
Weights,  and  Discriminant  Loadings. 
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(a)  WUk 's  Lambda 


Wilk’s  Lambda  values  are  computed  via  a  ratio  of  determinants  of  MDA  scatter 
matrices  [409];  therefore  this  method  is  considered  to  be  a  post-classification  DRA 
method.  Wilk’s  Lambda  has  been  used  in  various  MDA  application,  e.g.  [410,  411],  and 
is  computed  as 


dctS^y 
detST  ’ 


(4.9) 


which  is  a  ratio  between  the  detenninant  of  the  within  and  total  scatter  matrices  with 


A  e  [0  1]  [409].  In  operation,  large  values  of  A  indicate  poor  separation  between  groups, 
while  smaller  values  of  A  indicate  good  separation  between  groups  [409].  Logically, 
large  group  separations  lend  themselves  to  improved  discrimination;  therefore  with  lower 
A  values  are  associated  with  more  relevant  features  for  classification  [409]. 

The  Wilk’s  Lambda  method  is  used  for  DRA  by  computing  each  feature’s 
A  values  using  (4.9).  For  consistency  with  other  DRA  methods,  herein  Wilk’s  Lambda 
results  are  considered  as  1  -  A,  to  ensure  that  higher  values  indicate  more  relevant 
features  Figure  IV-13  presents  the  1  -  A  values  for  SNR  =  10  dB  for  ZigBee.  Consistent 
with  the  KS-test,  F-test,  and  GRLVQI  feature  relevance  ranking,  the  phase  features 
appear  most  relevant  in  Figure  IV-13. 


118 


C/D 


M 

s 


Figure  IV-13:  Feature  ranking  values  from  Wilk’s  Lambda  ratio  using  full¬ 
dimensional  NF  =  729  ZigBee  TNG  observation  at  SNR  =  10  dB. 


(b)  Discriminant  Weights  and  Group  Means 

One  potential  MDA-based  DRA  approach  would  be  to  remove  features  associated 
with  relatively  low  eigenvector,  or  discriminant  function  coefficients,  as  employed  in 
[412-414].  However,  eigenvectors  are  considered  to  be  generally  unsuitable  for 
providing  feature  relevance  infonnation  [237],  and  this  is  considered  imprecise  for  this 
purpose  with  small  values  can  appear  insignificant  while  actually  being  significant  from 
an  MDA  standpoint  [351].  For  this  reason,  discriminant  weights  themselves  are  not 
considered  for  DRA.  However,  the  basis  of  this  approach,  determining  the  connections 
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between  discriminant  functions  and  the  data  features,  is  similar  to  the  discriminant 
loadings  methods. 

(c)  Discriminant  Loadings 

Discriminant  loadings  were  presented  in  Section  3.1.1,  and  are  analogous  to 
principal  component  loadings  in  describing  how  each  feature  contributes  to  a  given 
projection  vector  [237,  415].  Visually  examining  MDA  loadings  is  one  approach  to 
interpretation  [416].  Figure  IV-14  presents  discriminant  loadings  for  the  Np  =  729  and 
Nc=  4  full-dimensional  ZigBee  TNG  fingerprint  set  with  values  from  (12)  for  Noun  =  3 
loadings  vectors,  as  detennined  via  (3.10).  In  Figure  IV-14  both  positive  and  negative 
MDA  loadings  values  are  visible.  Also  visible  is  an  almost  periodic  sign  change,  which 
is  possibly  due  to  the  binning  process  where  adjacent  bins  could  naturally  be  expected  to 
have  a  directionally  opposite  action  [417].  Also  of  interest  is  that  the  phase  features 
appear  to  have  higher  magnitude  loading  values  than  amplitude  and  frequency,  which  is 
consistent  with  other  DRA  methods. 
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Feature  Index  Number 


Figure  IV-14:  ZigBee  discriminant  loadings  (L)  for  the  three  discriminant  functions 
using  full-dimensional  NF  =  729  ZigBee  TNG  observations  at  SNR  =  10  dB. 

Reprinted  from  [135], 


However,  apparent  in  Figure  IV-14  is  that  each  discriminant  function  presents 
different  loading  values  for  each  fingerprint  feature.  Necessary  in  DRA  is  ranking  each 
fingerprint  feature  with  a  single  value  and  it  is  not  readily  apparent  how  to  rank  multiple 
loadings  values  for  each  feature.  Therefore  algorithmic  fusion  methods  will  be 
considered  to  develop  an  MDA  loadings  ranking  method. 
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4.2.3  Algorithmic  Fusion  Methods 

With  multiple  competing  DRA  methods  used  for  feature  selection,  the 
combination  of  methods  could  be  of  interest.  Fusion,  in  the  signal  processing  sense, 
involves  the  combination  of  data,  data  features,  or  decisions  from  data  for  a  combined 
result  [418],  Fusion  extends  from  Aristophanes’  concept  of  @povzi<7ztjpio,  or 
phrontisterion,  the  ‘think  tank’  [419,  p.  162;  420].  Of  interest  herein  are  ‘fusing’  various 
feature  selection  algorithms  in  an  attempt  to  gain  confidence  in  the  features  that  are 
retained.  To  pursue  this  aim,  a  general  review  on  fusion  is  needed.  Figure  IV- 15  presents 
the  three  general  types  of  fusion:  data,  feature,  and  decision.  In  general: 

1 .  Data  Level  Fusion  -  combines  the  data  from  different  sources;  examples 
include  combining  a  hyperspectral  image  pixel  vector  with  the 
corresponding  SAR  intensity  of  that  point  [421]  and  combining  different 
medical  test  values  (e.g.  blood  sugar,  enzymes,  and  etc.) 

2.  Feature  Level  Fusion  -  combines  the  extracted  features  in  some  manner  to 
be  input  to  a  classifier/detector/etc.,  a  few  examples  would  include 
examining  PCA  vectors  from  two  different  data  sources  in  an  ANN  as 
ANN  inputs,  or  the  addition  of  the  patients  address  to  the  medical  test 
values  (in  the  above  example) 

3.  Decision  Level  Fusion  -  combines  the  decision  of  multiple  processes  to 
create  a  combined  decision.  A  few  examples  of  this  would  be  1)  applying 
multiple  statistical  classifiers  to  the  same  problem  and  then  combining 
their  result  to  create  a  final  score,  2)  including  multiple  doctors  in  a 
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patient’s  diagnosis,  3)  combining  a  human  interpretation  of  data  with  a 
computer  decision  (which  might  also  be  a  fusion  of  multiple  statistical 
classifiers  too). 

Additionally,  variants  on  the  architectures  presented  in  Figure  IV- 15  can  exist;  for 
instance,  Zhao  et  al.  [422]  created  a  combined  feature-decision  fusion  approach  with 
different  feature  subsets  used  for  each  classifier.  The  architecture  of  Zhao  et  al.  [422]  is 
therefore  also  a  fonn  of  series  fusion.  Generally,  either  diversity  and/or  accuracy  are 
used  as  measures  for  combining  classifiers  [423].  Recent  results  have  indicated  that 
classification  consistently  outperforms  diversity  when  combining  classifiers  [423], 
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Figure  IV-15:  Three  General  Fusion  Method  Architectures,  adapted  from  [418], 


4.2.3. 1  MPA  Loadings  Fusion  (MLF) 

As  apparent  in  Figure  IV- 14  interpretation  of  MDA  loadings  into  actionable 
feature  rankings  is  non-trivial.  Perreault  et  al.  [424]  introduced  a  composite  Potency 
index, 


■“Pot 


—  L2 


\<NDim  i 

Ai=l  m 


(4.10) 


124 


which  both  squares  each  loading  value  to  remove  interpretation  issues  associated  with  the 
direction  of  the  loading  considered  combines  and  scales  each  loading  value  by  the 
eigenvalue.  Conceptually,  the  Potency  index  is  a  form  of  MDA  Loadings  Fusion  (MLF), 
where  loadings  are  fused  through  various  methods  to  compute  a  final  score.  Although 
the  Potency  index  has  seen  use  in  various  MDA-based  DRA  application,  e.g.  [425-432], 
variations  of  this  concept  have  not  been  explored.  The  Potency  index  and  MLF  methods 
are  also  conceptually  similar  to  the  weighted  principal  component  approach  of  [433]; 
however,  Kim  and  Rattakorn  [433]  considered  variance  explained  and  employed  a 
moving  range  for  selecting  an  appropriate  level  of  dimensionality. 

The  following  MLF  strategies  are  therefore  considered:  first,  unsealed  MLF, 
where  each  loading  for  each  feature  will  be  considered  as  having  an  equal  vote,  second, 
scaled  MLF,  where  each  loading  will  be  scaled  by  its  relative  weight  as  determined  by 
the  eigenvectors. 

(a)  Unsealed  MLF 

Thus,  the  following  methodology  was  developed  to  create  a  single  score  for  each 
fingerprint  feature: 

1 .  Compute  the  absolute  value  of  all  loadings  vectors 

2.  Apply  a  fusion  method  (maximum  or  sum)  to  create  a  single  vector  for 
ranking  features. 

Two  fusion  methods  were  considered  for  Step  2,  including  1)  an  Unsealed  Maximum 
( U Max)  score  representing  the  maximum  loading  for  each  feature  and  2)  an  Unsealed 
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Sum  ( USum )  score  representing  the  summation  of  loading  values  for  each  feature.  The 
U Sum  score  is  computed  by  summing  the  loadings,  L,  across  the  columns,  for  the  ith 
feature  this  is  computed  as 

N  DoF 

LuSum.i  =  ^  (4.11) 

)= 1 

Similarly,  the  UMax  score  is  computed  by  finding  the  maximum  value  of  the  loadings,  L, 
across  the  columns,  for  the  ith  feature  this  is  computed  as 

LSsum,i  =  max(If).  (4.12) 

Results  presented  in  Figure  IV- 16  display  the  UMax  MDA  loadings  scores  which  show 
that  phase  features  are  again  the  most  relevant  for  classifier  model  development. 


126 


0  150  300  450  600  750 

Feature  Index  Number 

Figure  IV-16:  Feature  ranking  values  from  Unsealed  Maximum  ( UMax ) 
discriminant  loadings  using  full-dimensional  NF  =  729  ZigBee  TNG  observations  at 

SNR  =  10  dB. 

(b)  Scaled  MLF 

While  the  scaled  MDA  loadings  presented  in  Figure  IV- 17  reflect  overall  how 
each  feature  is  correlated  to  a  given  discriminant  function,  it  ignores  additional 
information  contained  in  the  Eigenvalues.  Therefore  a  further  MLF  method,  involving 
scaling  the  MDA  loadings  by  their  respective  Eigenvalues,  is  a  logical  extension  to 
account  for  the  contribution  that  each  discriminant  function  gives  to  total  variance. 

The  loadings  signify  how  each  data  feature  is  correlated  to  a  given  discriminant 
function.  Because  discriminant  functions  are  also  weighted  by  eigenvalue,  it  is  not 
directly  intuitive  how  to  use  them  for  feature  selection.  The  method  proposed  involves 
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averaging  the  discriminant  loadings  after  scaling  them  by  their  eigenvalue’s  contribution 
to  total  variance  explained.  This  is  computed  as 


Is  =  \L\ 


I  X^^oim  i  I 

U-i=l 


(4.13) 


which  is  very  similar  to  the  Potency  index  of  [424]  and  (4.10),  but  avoids  the  squared 
loadings  of  (4.10)  which  shrink  the  overall  MDA  loadings  magnitude. 

This  method  enables  the  discriminant  loadings  to  be  ranked  by  the  eigenvalue  of 
each  discriminant  function  and  by  the  contribution  of  each  feature  to  each  discriminant 
function. 

The  following  general  methodology  was  used  for  Scaled  MLF  and  is  further 
described  in  [417]: 

1 .  Compute  the  absolute  value  of  all  loadings  vectors, 

2.  Multiply  each  absolute  value  loadings  vector  by  the  appropriate  Eigenvalue- 
based  weight  per  (4.13), 

3.  Apply  a  fusion  method  (maximum  or  sum)  to  create  one  vector  for  ranking 
features. 


Consistent  with  Unsealed  MLF  are  two  fusion  methods  for  Step  3:  1)  a  Scaled 
Maximum  ( SMax )  score,  and  2)  a  Scaled  Sum  (SSum)  score.  The  SSum  score  is  computed 
by  summing  the  scaled  loadings,  Ls,  across  the  columns,  for  the  Ith  feature  this  is 
computed  as 
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N  DoF 

Lssum.i  =  ^  Ls.i-  (4-14) 

7  =  1 

Similarly,  the  SMax  score  is  computed  by  finding  the  maximum  value  of  the  scaled 
loadings,  Ls,  across  the  columns,  for  the  ith  feature  this  is  computed  as 

Lssum.i  =  max(LS;i).  (4.15) 

Figure  IV- 17  presents  a  series  of  scatterplots  to  show  the  general  relationship  between 
UMax,  USum,  SMax,  and  SSum  for  the  full-dimensional  Nf  =  729  feature  set  at 
SNR  =  10  dB.  As  presented  in  [417],  Figure  IV- 17  shows  that  the  four  fusion  methods 
appear  to  largely  provide  different  results  with  two  exceptions:  1)  that  UMax  and  USum 
are  correlated,  and  2)  that  SMax  and  SSum  are  highly  correlated.  However,  all  four 
methods  are  further  considered  since  small  differences  between  methods  can  result  in 
different  DRA  subsets  and  thus  different  results. 


129 


t>< 

c<3 


T3 

D 

<73 

O 

CO 


Unsealed  Sum 


03 


2 


<D 


<73 

O 

to 


§ 


Scaled  Max 


<73 


2 


CD 


<73 

O 

to 


Scaled  Sum  Scaled  Sum  Scaled  Sum 

Figure  IV-17:  Matrix  scatterplots  for  four  MDA  Loadings  Fusion  (MLF)  methods, 
Unsealed  ( UMax  and  USum)  and  Scaled  ( SMax  and  SSum),  using  full-dimensional 
NF  =  729  feature  set  at  SNR  =  10  dB.  Reprinted  from  [135]. 


4.2.4  DRA  Fusion  Methods 

Herein,  post-classification  feature  extraction,  termed  “DRA  fusion,”  is  considered 
as  an  extension  of  decision  fusion.  Three  DRA  fusion  methods  are  developed:  rank- 
based  DRA  fusion,  score-based  DRA  fusion,  and  concatenation  DRA  fusion. 

4.2. 4.1  Rank  and  Score  Based  Fusion 

Rank  and  score  based  fusion  extend  series  fusion  by  considering  the  DRA  ranking 
scores  for  each  feature.  Both  methods  operate  similarly  and  are  conceptualized  in  Figure 
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IV- 18.  Step  1  in  Figure  IV- 18  considers  the  ranks  or  normalized  scores  for  each  method, 
in  Step  2  these  are  fused  via  summation  and  a  new  feature  relevance  ranking  vector  is 
computed. 


1 .  Ranks  or 
normalized 
scores 

summed  and 
re-sorted 


2.  Fused  Feature 
Relevance 
Ranking  Vector 


Figure  IV-18:  Generic  Example  of  Score  and  Rank  Fusion 


(a)  Score  Based  DRA  Fusion 

Score-based  DRA,  first  normalizes  the  disparate  DRA  feature  selection  scales  to  a 
common  scale  via  min-max  data  normalization, 

V  -  Xmin 

Xmin-max  =  y - — ■,  (4.16) 

A-max  A min 

where  X  is  the  original  data,  Xmiri_maxis  the  scaled  data,  Xmin  is  the  minimum  value,  and 
Xmax  is  the  maximum  value,  can  be  used  to  place  values  on  a  [0,  1]  interval  [434]. 
Although  min-max  normalization  is  sensitive  to  outliers  [434],  it  is  both  a  very  common 
approach  and  places  scores  on  an  advantageous  [0,1]  interval.  Following  nonnalization, 
scores  from  DRA  methods  are  summed  and  then  a  new  feature  relevance  ranking  vector 
is  computed  from  the  fused  scored. 
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(b)  Rank  Based  DRA  Fusion 

Dichotomization  involves  converting  a  continuous  variable  into  a  discrete 
variable.  An  example  of  doing  so  would  be  converting  continuous  relevance  scores  into 
a  ranked  list,  as  described  by  [435],  Rank-based  DRA  fusion  first  considers  the  ordered 
ranking  of  each  DRA  method  under  consideration,  these  ranks  are  summed  and  a 
resulting  summed  rank  vector  is  computed.  The  ordered  rank  of  the  summed  rank  vector 
is  then  used  to  detennine  feature  relevance  ranking.  Thus  rank-based  DRA  fusion  is 
similar  to  score-based  DRA  fusion  with  the  exception  that  the  raw  scores  are  not 
considered. 

However,  employing  ranks  may  not  be  advantageous  due  to  dichotomization 
issues.  It  is  generally  recommended  to  use  continuous  data,  when  available,  rather  than 
categorical  data  [436-441].  However,  one  encounters  ranked  lists  in  various  feature 
relevance  ranking  operations  and  for  RF-DNA  rank-based  DRA  fusion  avoids  issues  with 
score  normalization,  therefore  considering  the  possibility  of  fusing  results  based  on  rank 
is  considered. 

4.2. 4.2  Concatenation  Fusion 

Rank  and  score  feature  relevance  ranking  fusion  seek  to  fuse  the  overall  score  of 
multiple  feature  relevance  ranking  methods.  Concatenation  fusion  involves 
concatenating  two  or  more  vectors  to  form  a  single  vector  and  has  seen  application  in  a 
variety  of  fields,  c.f.  [442-456].  Herein,  an  approach  similar  to  that  of  Kekre  et  al.  [457] 
is  developed,  where  the  selected  features  are  appended  to  each  other.  However,  care 
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must  be  taken  in  this  process  as  multiple  identical  features  will  at  a  minimum  add 
redundant  features  and  necessarily  introduce  multicollinearity  problems. 
Multicollinearity  issues  violate  assumptions  of  MDA  and  other  linear  classifiers, 
therefore  adding  unique  features  is  obvious  necessary  in  feature  selection  fusion.  Such  a 
problem  was  not  a  concern  for  Kekre  et  al.  [457]  since  they  were  fusing  Red,  Green,  and 
Blue  pixel  infonnation  and  hence  was  not  concerned  with  uniqueness. 

The  RF-DNA  concatenation  DRA  fusion  method  is  conceptualized  in  Figure 
IV- 19.  Here,  a  user  selects  the  desired  total  Ndra  and  the  Ndra/  method  top  ranked  features 
are  proportionally  taken  from  each  DRA  method, 

(  Nq  \ 

Ndra /method  =  round  - - )'  (4-17) 

'^methods' 

where  Nmethods  are  the  number  of  DRA  methods  to  be  fused.  The  process  in  Figure  IV-19 
then  removes  repeated  features  to  avoid  singularity  issues.  The  process  then  adds  one 
next  highest  ranked  feature  from  each  DRA  method  and  iterates  until  the  fused  vector  has 
NDra  features. 


133 


4.  Iterate  until 
reaching 
round(k/p) 
features  remain 


2.  Remove 
repeated 
features 


1 .  Append  top 
round(k/p)  features 
from  p  DRA 
methods 


Remove 

repeated 

features 


3.  Append  1 
feature  from 
each  DRA 
method 


Figure  IV-19:  General  Process  for  Concatenation  Fusion 


4.2.5  Random  Feature  Selection 


When  considering  RF-DNA  data,  where  there  are  hundreds  of  features,  one  could 
logically  posit  that  any  randomly  selected  and  sufficiently  large  set  of  features  could 
perfonn  adequately.  Since  the  ZigBee  and  Z-Wave  RF-DNA  datasets  have  no  know 
corrupt  features,  it  is  very  logical  to  believe  that  any  random  subset  of  features  would 
offer  some  discriminating  ability. 

To  account  for  this  possibility,  a  random  feature  selection  approach  is  considered 
to  provide  a  lower  bound  for  performance.  For  ZigBee,  the  random  feature  selection 
approach  considers  a  uniform  random  feature  relevance  ranking  values  U(0,1)  for  Nf 
=  729  feature  set.  An  implicit  assumption  that  higher  magnitude  random  ranking  values 
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are  more  relevant  was  used  to  select  Ndra  feature  sets.  Since  one  random  set  of  rankings 
may  produce  good  results,  replications  are  used  and  then  classification  and  verification 
accuracies  are  averaged  for  the  replicates.  Performance  from  the  random  feature 
selection  therefore  offers  a  minimum  expected  level  of  performance  for  a  given  Ndra- 

4.2.6  Dimensionality  Assessment 

With  relevance  ranked  features,  DRA  next  involves  selecting  an  appropriate  level 
of  dimensionality.  Both  qualitative  and  quantitative  DRA  dimensionality  assessment 
methods  are  possible.  Prior  RF-DNA  DRA  research,  e.g.  [89,  113,  121],  examined 
qualitative  DRA  for  RF-DNA  fingerprint  features;  however  these  were  based  on 
subjective  assessments  which  may  not  be  precise.  Herein  quantitative  DRA  approaches 
to  estimate  the  intrinsic  dimensionality  in  the  data  are  developed.  As  noted  by  Jain  et  al. 
[213],  an  optimal  approach  to  selecting  features  is  via  exhaustively  examining  classifier 
results  produced  from  all  possible  combinations  of  features.  However,  this  is  very 
computationally  intensive  (and  was  noted  as  such  by  Jain  et  al.  [213])  and  is  not  practical 
for  large  datasets  such  as  the  ZigBee  RF-DNA  data  where  N Feats  =  729.  Therefore 
quantitatively  DRA  approaches  that  examine  intrinsic  dimensionality  of  the  data  are 
developed  and  considered. 

4.2. 6.1  Qualitative  Dimensionality  Assessment 

Prior  RF-DNA  work,  c.f.  [89,  113,  121]  examined  qualitative  DRA  methods  for 
RF-DNA  where  subjective  operator  experience  was  used  to  select  Ndra-  This  was 
partially  due  to  having  no  explicit  selection  criteria  for  selecting  Ndra  based  on  KS-Test 
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/j-value  or  GRLVQI  relevance  values.  To  detennine  an  appropriate  number  of  ranked 
features  to  retain,  Dubendorfer  et  al.  [113]  examined  various  qualitative  operating  points 
corresponding  to 

Ndra  =  [25,  50, 100, 200, 243]  (4.18) 

feature  sets.  These  were  evaluated  using  an  MDA/ML  classifier,  with  the  conclusion  that 
Ndra  =50  features  (selected  using  either  KS-test  /7-values  or  GRLVQI  relevance  values) 
offered  sufficient  classification  performance.  However,  this  quantity  or  proportion 
(50/729,  or  6.86%  of  the  available  features)  is  not  necessarily  generalizable  to  other  RF- 
DNA  fingerprint  datasets  and  applications.  Additionally,  it  is  not  known  how  to 
systematically  search  for  these  quantities.  Therefore  creating  quantitative  approaches 
based  on  the  data  itself  are  of  particular  interest. 

4.2. 6.2  Quantitative  Dimensionality  Assessment 

Various  quantitative  dimensionality  selection  methods  exist  based  on  data 
covariance  and  correlation  matrix  responses  [458-461].  Additionally,  heuristics  exist 
based  on  /7-value  significance  and  MDA-loadings  magnitudes  [358].  Of  interest  are 
developing  quantitative  dimensionality  assessment  methods  for  RF-DNA  applications 
through  data  covariance  and  correlation  matrices,  /7-values,  and  MDA-loadings. 

(a)  Heuristic-based  Approaches  on  Discriminant  Loadings 

Discriminant  loading  magnitudes  can  also  be  used  to  estimate  an  appropriate 
number  of  features  to  retain.  Various  publications,  c.f.  [462-464],  suggested  that 
discriminate  loadings  magnitudes  greater  than  0.30  indicate  a  feature  is  significant. 
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Given  that  these  works  did  not  address  scaled  loadings,  the  heuristic  value  of  0.30  was 
applied  to  Unsealed  Max  scores  at  SNR  =  1 0  dB  and  yielded  Ndra  =  51  as  the  number  of 
loadings  greater  than  0.30  in  each  composite.  Because  Ndra  =  51  is  equivalent  to  the 
Ndra  =  50  determined  by  [113],  this  leads  credence  to  the  qualitative  method  of  [1 13]  and 
thus  only  Ndra  =  50  will  be  further  examined  for  consistency  with  prior  work. 

(b)  P-value  based  Approaches 

Another  approach  to  DRA  assessment  involves  electing  Ndra  from  /;- value 
significance  [358].  As  described  in  Section  (b)  /^-values  tend  to  zero  for  RF-DNA 
fingerprints  and  thus  employing  a  p-value  threshold  for  quantitative  DRA  could  involve 
retaining  a  majority  of  the  data.  For  instance,  at  lOdB,  if  one  employed  a  /;- value 
threshold  of  5%,  a  common  statistical  significance  threshold,  one  would  retain  Ndra  = 
674  if  using  the  F-test  or  NDra  =  5 12  if  using  the  KS-test. 

Table  IV-3  further  presents  the  quantity  of  retained  features  using  the  F-test  and 
KS-test  at  SNR  =  [0,  10,  18,  30]  dB  for  different  statistical  significance  levels.  Statistical 
significance  levels  of  [0.1%,  1%,  5%,  10%]  are  employed  as  commonly  used  [465], 
although  largely  arbitrary  [379],  statistical  thresholds.  Comparing  Table  IV-3  with  the 
results  of  [121]  indicates  that  /;- value  DRA  assessment  heavily  over-estimates  the 
number  of  features  to  retain  since  phase  (0)  features,  Nf=. 243  herein,  are  known  to  offer 
perfonnance  comparable  to  the  baseline.  Therefore,  /;- value  dimensionality  assessment 
appears  neither  appropriate  or  is  considered  for  ZigBee  RF-DNA  data. 
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Table  IV-3:  Dimensionality  Assessment  by  p-\ alue  and  Significance  Level, 

Reprinted  from  [49], 


SNR 

Method 

Significance  Level 

0.1% 

1%, 

5% 

10% 

OdB 

F-test  p-values 

196 

264 

350 

402 

KS-test  summed  p-values 

37 

74 

130 

160 

10  dB 

F-test  p-values 

589 

639 

674 

688 

KS-test  summed  p-values 

337 

414 

512 

557 

18  dB 

F-test  p-values 

706 

713 

720 

722 

KS-test  summed  p-values 

666 

692 

711 

716 

30  dB 

F-test  p-values 

718 

725 

727 

728 

KS-test  summed  p-values 

727 

729 

729 

729 

(c)  Data  Covariance  Matrix  Approaches 

DRA  assessments  on  the  intrinsic  dimensionality  in  data  can  also  be  considered. 
If  one  considers  the  eigenvalues  of  the  data  covariance  (or  correlation  matrix)  one  can 
estimate  data  dimensionality  based.  Given  that  RF-DNA  features  have  consistent  units, 
the  covariance  matrix  was  considered  herein  with  three  quantitative  DRA  assessment 
methods:  Kaiser’s  Criterion,  Maximum  Distance  Secant  Line  (MDSL),  and  Horn’s 
Curve. 

(i)  Kaiser  Criterion 

Kaiser  criterion  offers  a  basic  estimate  of  Nora  with  Eigenvalues  greater  than  the 
average  eigenvalue  being  retained  [237,  458,  466];  when  correlation  eigenvalues  are 
considered,  this  results  in  all  eigenvalues  greater  than  1  being  retained  [467].  Although  it 
can  offer  reasonable  perfonnance,  it  is  also  acknowledged  as  a  rather  arbitrary  method 
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[458],  Because  this  metric  is  frequently  generalized  to  just  selecting  the  eigenvalues 
above  1,  both  the  appropriate  metric  (above  the  mean)  for  covariance  eigenvalues  is 
presented  along  with  the  ‘above  1’  metric. 

Kaiser  criterion  offers  a  basic  estimate  of  dimensionality  with  the  DRA 
assessment  made  where  the  quantity  of  covariance  matrix  eigenvalues  greater  than  the 
mean  are  retained  [237,  458].  Although  offering  reasonable  performance,  Kaiser  is 
acknowledged  as  a  rather  arbitrary  method  [458].  Kaiser’s  criterion  at  SNR=  10  dB 
suggests  retaining  Ndra  =  191  features. 

(ii)  Cattell  's  Scree  Plot 

One  extension  of  the  Kaiser  criterion  involves  including  visual  subjectivity  in  the 
form  of  Scree  plots.  Scree  plots  involve  two  dimensional  plots  of  data  covariance  (or 
correlation)  matrix  Eigenvalues  versus  rank  order,  and  provide  a  visual  method  of 
detennining  the  dimensionality  of  the  data  [237].  Cattell’s  Scree  Test,  involves  visually 
examining  the  scree  plot  and  selecting  Ndra  above  the  inflection  point,  the  proverbial 
‘elbow  in  the  curve’  [458].  The  difficulty  of  this  methods  involves  selecting  the  actual 
inflection  point  and  Noil a- 

1.  Maximum  Distance  Secant  Line  (MDSL) 

The  MDSL  approach,  introduced  by  Johnson  et  al.  [468],  aims  to  remove 
subjectivity  from  Cattell’s  Scree  Test  through  algorithmic  means.  MDSL  both  removes 
subjectivity  of  Cattell  through  automation,  where  1)  one  creates  a  line  between  the  first 
and  last  rank  ordered  eigenvalues  and  2)  on  then  finding  the  point  with  the  largest 
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perpendicular  distance  from  this  line,  i.e.,  the  inflection  point  [468],  Using  MDSL  at 
SNR  =10  dB  Nora  =  26  features  would  be  retained. 

(iii)  Horn ’s  Curve 

Horn’s  curve  is  another  eigenvalue  based  DRA  assessment  method  where 
eigenvalues  are  computed  for  a  random  dataset  of  the  same  size  and  rank  as  the  ZigBee 
fingerprint  set  under  analysis  [469].  Horn’s  curve  involves  plotting  the  data  sample 
correlation  matrix  eigenvalues  against  the  Horn’s  curve  eigenvalues  [469].  The  intrinsic 
dimensionality  of  the  data  is  determined  by  counting  the  number  of  data  eigenvalues  that 
appear  above  Horn’s  curve  [469].  Using  the  Horn’s  curve  algorithm  of  Bigley  [466],  at 
SNR  =  10  dB  Horn’s  curve  indicated  Nora  =  157  features  should  be  retained. 

4.2. 6.3  DRA  Assessments  and  ZigBee  RF-DNA  Features 

As  all  of  the  presented  DRA  assessments  provided  different  Nora  subsets, 
multiple  DRA  subsets  must  be  considered.  For  comparison  with  qualitative  methods, 
Nora  =  [50,  100]  subsets  are  examined  for  consistency  with  [113],  additionally  a  lower 
qualitative  DRA  assessment  of  Nora  =  10  is  also  important  to  examine  to  understand 
perfonnance  when  only  a  very  limited  subset  of  features  are  available  and  thus  examine 
how  DRA  methods  fundamentally  interacts  with  classifier  performance.  The  resultant 
NorA  subsets  to  examine  for  competing  DRA  methods  is  thus 
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NDRa  =  [10,26,50,100,157,191],  (4-19) 

which  considers  both  quantitative  and  qualitative  methods.  Comparison  with  the  full¬ 
dimensional  Nora  =  729  feature  set  is  also  requisite  to  generate  a  performance  baseline 
for  comparison. 

4.3  DRA  Applications  to  ZigBee  Data 

To  understand  and  compare  the  presented  DRA  methods,  first  a  simple 
comparison  of  DRA  methods  results  through  correlation  will  be  considered.  Then  a 
comparison  of  how  different  DRA  methods  select  different  features  will  be  discussed. 
Finally,  a  comparison  of  classification  and  verification  performance  assessments,  with  the 
MDA/ML  classifier,  will  be  made  using  the  ZigBee  dataset. 

4.3.1  DRA  Method  Comparisons 

Consistency  was  seen  in  the  KS-test,  F-test,  GRLVQI  relevance  values,  and  MDA 
loadings  where  phase  ( <p )  features  are  noticeably  more  relevant  than  both  Amplitude  (a) 
and  Frequency  (J)  features.  This  observation  is  further  consistent  with  [89,  113],  which 
concluded  that  Phase  ( <p )  features  alone  are  typically  the  most  relevant  for  reliable  device 
discrimination. 

However,  it’s  not  apparent  that  each  method  scores  the  same  features  similarly. 
Table  IV-4  presents  a  correlation  matrix  using  Pearson  correlations  at  SNR  =  10  dB,  were 
most  methods  are  seen  to  be  not  highly  correlated  in  their  scores.  Incidentally,  both 
GRLVQI  relevance  and  random  loadings  were  the  least  correlated  to  any  other  method, 
indicating  limited  similarity  to  the  other  methods.  SSum  and  SMax  were  highly 
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correlated,  while  the  other  loadings  methods  are  less  correlated,  thus  indicating  that 
loadings  methods  are  sensitive  to  the  fusion  method. 

In  Table  IV-4,  both  the  KS-test  and  the  F-test  are  seen  to  be  highly  correlated, 
which  indicates  that  both  methods  achieve  similar  results.  This  is  largely  a  logical  result 
because  both  methods  are  univariate,  distribution  based,  and  consider  a  given  feature  and 
a  vector  of  categorical  class  identities.  The  F-test  result  was  also  highly  correlated  with 
USum  and  UMax,  mirroring  the  results  of  [462]  which  reported  a  positive  correlation  of 
0.675  between  DRA  loadings  and  the  F-test. 
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Table  IV-4:  Correlation  Matrix  for  DRA  Method  Scores  at  SNR  =  10  dB,  from  Bihl  et  al.  [135].  High  correlations  (>0.8) 

and  low  correlations  (<0.2)  are  in  bold  and  shaded  light  grey. 


DRA  Feature  Selection  Method 

Pre-classification 

Post-classification 

Baseline 

KS 

F-Test 

GRLVQI 

Wilk’s 

MLF 

SMax 

MLF 

SSum 

MLF 

UMax 

MLF 

USum 

Random 

Pre- 

KS 

1.0 

0.665 

m 

tm 

0.6977 

-0.038 

classification 

F-Test 

1.0 

mm 

eszi 

0.890 

0.011 

GRLVQI 

1.0 

-0.082 

-0.094 

-0.030 

-0.167 

-0.178 

0.041 

Wilk’s 

1.0 

0.377 

0.144 

0.730 

0.726 

-0.037 

MLF 

SMax 

1.0 

0.8589 

0.630 

0.565 

-0.035 

Post- 

classification 

MLF 

SSum 

1.0 

0.257 

0.253 

-0.046 

MLF 

UMax 

1.0 

0.937 

-0.004 

MLF 

USum 

1.0 

-0.012 

Baseline 

Random 

1.0 
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Since  Table  IV-4  illustrates  that  each  DRA  method  is  ranking  features  differently, 
examining  the  top  ranked  features  across  DRA  methods  is  of  interest.  Consistent  with 
[417],  Figure  IV-20  considers  the  top  Ndra  =  10  features  through  a  bar  plot  showing 
which  features  are  selected  for  each  method.  Only  one  replicate  of  the  Random  Selection 
DRA  method  presented  for  brevity.  Of  interest  in  Figure  IV-20  is  that,  although  most 
features  selected  are  Phase  (0)  features  (indices  244  to  486)  most  DRA  methods  selected 
entirely  different  features  [417].  Interestingly,  a  few  features  in  Figure  IV-20  were 
consistently  selected  by  multiple  methods,  thus  indicating  that  some  features  are 
predominantly  important,  an  observation  consistent  with  results  in  [89,  113]. 
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Figure  IV-20:  Top  ranked  NF  =  10  reduced  dimensional  feature  sets  by  DRA 

method,  reprinted  from  [135], 
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Figure  IV-21  and  Figure  IV-22  further  consider  the  differences  in  DRA  method 
feature  ranking  for  Nf  =  26  and  Nf  =  50,  respectively.  While  the  figures  are  consistent 
with  those  of  Figure  IV-20,  where  methods  largely  select  different  features,  as  Nf 
increases,  it  is  apparently  that  DRA  methods  begin  to  select  similar  features,  which  are 
predominantly  phase  features. 
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Figure  IV-21:  Top  ranked  NF  =  26  reduced  dimensional  feature  sets  by  DRA 

method. 
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Figure  IV-22:  Top  ranked  NF  =  50  reduced  dimensional  feature  sets  by  DRA 

method. 

Table  IV- 5  further  examines  the  features  selected  by  each  DRA  method  per  each 
DRA  subset.  In  Table  IV-5,  the  collective  total  features  selected,  Ntot,  for  F-test,  KS- 
test,  GRLVQI,  Wilk’s  Lambda,  USum,  UMax,  SSum,  and  SMax,  are  presented  for  each 
N [)RA  subset.  When  considering  Ndra  =10,  NTot  =  61  total  features  were  selected; 
however,  78.7%  of  these  61  features  were  uniquely  selected  by  only  one  DRA  method 
and  hence  many  features  were  selected  by  multiple  DRA  algorithms.  Table  IV-5  presents 
additional  information  regarding  the  percentage  of  NTot  which  are  amplitude  {a),  phase 
((j)),  and  frequency  (f)  features,  and  the  percentage  of  variance  (a  ),  skewness  (y),  and 
kurtosis  (k)  statistics.  Notable  throughout  DRA  subsets,  and  consistent  with  [89,  1 13],  is 
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that  majority  of  features  selected  are  phase  features.  No  obvious  biases  are  seen  toward 
variance  (o'),  skewness  (y)  or  kurtosis  (k)  being  selected. 


Table  IV-5:  DRA  Subset  Statistics  for  F-test,  KS-test,  GRLVQI,  Wilk’s  Lambda, 
USum,  UMax,  SSum,  and  SMax.  Reprinted  from  [135]. 


DRA 

Subset 

N tot 

%  Unique 

(a,  <|>,  f)% 

(o2,  y,  k)  % 

NDRA  =  10 

61 

78.7% 

7.5,73.8,  18.7 

32.5,46.3,21.2 

NDRA  =  26 

142 

72.5% 

7.2,  65.9,  26.9 

34.6,38.0,  27.4 

NDRA  =  50 

238 

65.1% 

7.0,  64.3,28.7 

37.8,35.2,27.0 

NDRA  =  100 

381 

48.8% 

7.1,57.1,35.8 

38.6,  32.3,29.1 

NDRA  =  157 

505 

39.2% 

7.5,54.9,37.6 

38.3,31.7,30.0 

NDRA  =  191 

545 

31.9% 

8.1,53.5,38.4 

37.7,  32.5,29.8 

4.3.2  DRA  Method  Classification  Performance  Assessments 

Beyond  comparing  DRA  methods  statically,  further  comparison  of  DRA  methods 
through  MDA/ML  classification  accuracy  on  the  ZigBee  RF-DNA  dataset  need 
consideration.  Representative  MDA/ML  average  TST  %C  versus  SNR  results  are 
presented  in  Figure  IV-23  to  Figure  IV-25.  Figure  IV-23  presents  results  from  the 
MDA/ML  model  using  Ndra  =  10,  Figure  IV-24  presents  results  from  the  MDA/ML 
model  using  Ndra  =  26,  and  Figure  IV-25  presents  results  from  the  MDA/ML  model 
using  Ndr 4=  50.  Additional  results  from  Ndra  =  [100,  157,  191]  are  presented  later  in 
tables. 

Although  at  Ndra  =  10  no  feature  selection  method  achieves  the  %C  >  90% 
benchmark,  and  thus  relative  dB  gain  is  not  computed  for  comparison,  the  results  here  in 
Figure  IV-23  show  DRA  perfonnance  differences  across  methods.  Consistent  with 
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[395],  Figure  IV-23  shows  MLF-based  methods  as  offering  significantly  higher 
perfonnance  than  other  DRA  methods  with  MLF  methods  having  a  10%  improvement  in 
%C  for  most  of  the  SNR  considered  when  compared  to  other  methods.  Additionally, 
MLF  methods  have  an  SNR  gain  over  competing  DRA  methods  of  10  to  12  dB  for 
60%  <  %C  <75%  (max).  The  results  of  NDRA  =  10  suggest  that  MLF-based  DRA 
methods  perform  better  than  competing  methods  here  since  MLF  feature  relevance 
rankings  were  computed  close  to  the  functions  used  for  MDA  classifier  development. 

Results  for  Ndra  =  10  and  Ndm  =  26,  respectively  Figure  IV-24  and  Figure  IV-25, 
show  that  all  feature  selection  methods  tend  to  achieve  similar  perfonnance  as  the 
number  of  features  considers  increases  [417].  Despite  this,  some  differences  are  still  seen 
in  the  perfonnance  offered  by  the  DRA  methods  with  the  loadings  based-methods  again 
offering  significantly  higher  performance  than  the  other  methods  under  analysis. 
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Figure  IV-23:  ZigBee  MDA/ML  Testing  (TST)  classification  performance  for 
NDRA  =  10  reduced  dimensional  feature  sets,  reprinted  from  [135], 
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Figure  IV-24:  ZigBee  MDA/ML  Testing  (TST)  classification  performance  for 
Ndr.4  =  26  reduced  dimensional  feature  sets,  reprinted  from  [135]. 
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Figure  IV-25:  ZigBee  MDA/ML  Testing  (TST)  Classification  performance  for 
Ndr.4  =  50  reduced  dimensional  feature  sets,  reprinted  from  [135], 


Figure  IV-23  through  Figure  IV-25  represent  only  a  few  instances  showing  the 
relationship  between  Ndra  and  classification  perfonnance.  To  further  understand  how 
DRA  influences  performance,  Figure  IV-26  considers  classification  performance  and 
dimensionality  of  each  DRA  method  at  SNR  =  10  dB.  In  Figure  IV-26  additional  Ndra 
subsets,  NDRa  =  [250,  300,  350,  400,  450,  500,  550,  600,  650,  700],  are  considered  along 
with  those  of  (4. 14).  Figure  IV-26  shows  an  expected  decrease  in  classification  accuracy 
as  one  decreases  NDra,  which  is  especially  seen  for  NDRa  <  200.  Consistently  high 
perfonnance  is  further  seen  across  all  methods  except  Wilk’s  and  Random. 
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DRA  method  at  SNR  =  10  dB.  NDRA  =  [10,  26,  50, 100, 157, 191,  250,  300,  350,  400, 
450,  500,  550,  600,  650,  700]  reduced  dimensional  feature  sets  are  evaluated  to 
understand  how  DRA  fundamentally  impacts  performance.  Reprinted  from  [135], 


Table  IV-6  reproduces  a  table  in  [135]  by  presenting  gain  tradeoff  values  at 
%C>  90%  for  all  DRA  methods  and  all  Ndra  levels  of  dimensionality  from  (4.13). 
However,  Table  IV-6  presents  no  values  for  NDra  =  10  since  no  DRA  methods  achieved 
%C  >  90%  at  this  level  of  dimensionality  [135].  Gain  tradeoff  values  in  Table  IV-6  show 
a  considerable  advantage  of  MLF  methods  at  Ndra  =  26  over  other  methods,  where  MLF 
methods  achieve  better  perfonnance  than  either  GRLVQI  or  Wilk’s;  additionally,  SMax, 
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SSum,  and  UMax  achieve  better  performance  than  randomly  selected  sets.  Incidentally, 
the  MDA/ML  model  developed  using  either  KS-test  and  F-test  selected  features  do  not 
achieve  %C  >  90%  at  Nf=  26.  As  Ndra  increases  to  Nora  =  157  and  Ndra  =191  it  is  seen 
that  the  competing  DRA  methods  offer  comparable  classification  performance. 
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Table  IV-6:  Relative  DRA  “Gain”  (dB)  Over  Baseline  Performance  for  %C  =  90%  Classification  Accuracy.  Bold  entries 
with  light  grey  shading  denote  best  case  (lowest  gain)  performance  and  bold  entries  denote  values  within  10%  of  the  best. 

Reprinted  from  [135], 


DRA  Subset 

DRA  Feature  Selection  Method 

Pre- 

Classification 

Post-Classification 

Baseline 

KS 

F-Test 

GRLVQI 

WlLK’S 

MLF 

SMax 

MLF 

SSum 

MLF 

UMAX 

MLF 

USum 

Random 

Ndra  =  26 

TNG 

* 

* 

-18.747 

-18.727 

-14.269 

-13.347 

-13.809 

-14.607 

-14.937 

TST 

* 

* 

-19.349 

-19.967 

-14.167 

-13.817 

-13.847 

-14.967 

-15.407 

Ndra  ~  50 

TNG 

-7.877 

-8.337 

-8.357 

-9.617 

-7.947 

-7.697 

-7.897 

-9.957 

-13.557 

TST 

-8.077 

-8.687 

-8.787 

-10.157 

-8.347 

-7.967 

-8.387 

-10.137 

-13.007 

Ndra  =100 

TNG 

-4.707 

-4.587 

-3.387 

-5.577 

-4.137 

-4.817 

-4.127 

-5.747 

-8.997 

TST 

-4.887 

-4.817 

-3.407 

-5.987 

-4.487 

-4.957 

-4.477 

-6.067 

-8.777 

Ndra  =  157 

TNG 

-2.747 

-2.627 

-2.207 

-4.287 

-2.647 

-2.487 

-2.507 

-2.727 

-5.317 

TST 

-2.927 

-2.787 

-2.357 

-4.407 

-2.937 

-2.587 

-2.727 

-2.757 

-4.957 

Ndra  =191 

TNG 

-2.007 

-1.907 

-1.767 

-3.447 

-2.007 

-1.897 

-2.017 

-2.317 

-5.967 

TST 

-2.087 

-2.077 

-1.917 

-3.437 

-2.267 

-1.927 

-2.147 

-2.407 

-5.837 

*Denotes  cases  where  methods  never  achieve  %C  =  90% 
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While  the  RAP  results  in  Table  IV-7  offer  comparable  information  as  seen  in 
Table  IV-7,  RAP  enables  the  ability  to  examine  both  Nf=  10  perfonnance,  which  could 
not  be  examined  using  gain,  and  RAP  enables  a  comparison  across  SNR  all  operating 
points.  In  Table  IV-7,  higher  values  indicate  higher  perfonnance  and  thus  MLF  DRA 
methods  are  seen  to  offer  the  highest  performance  overall.  From  a  classification 
standpoint,  the  loadings  methods,  especially  SSum,  UMax,  and  USum  appear  to  therefore 
offer  higher  and  more  consistent  performance.  Thus  MLF  methods  offer  a  clear 
classification  performance  improvement  over  methods  previously  presented,  e.g.  [89] 
[113]. 
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Table  IV-7:  Relative  Accuracy  Percentage  (RAP)  from  Baseline  NDRa  =  729  Feature  Set.  Bold  entries  with  light  grey 
shading  denote  best  case  (highest  scoring)  performance.  Reprinted  from  [135]. 


DRA  Subset 

DRA  Feature  Selection  Method 

Pre-Classification 

Post-Classification 

Baseline 

KS 

F-Test 

GRLVQI 

Wilk’s 

MLF 

SMax 

MLF 

SSum 

MLF 

UMax 

MLF 

USum 

Random 

N DRA  =  10 

TNG 

65.12 

70.82 

62.99 

71.28 

71.12 

68.50 

71.17 

72.71 

61.48 

TST 

65.52 

71.59 

63.79 

72.29 

71.83 

68.91 

71.84 

73.33 

61.87 

N dr  a  =  26 

TNG 

78.23 

78.14 

79.97 

77.61 

79.38 

81.82 

79.39 

81.85 

74.23 

TST 

78.99 

79.16 

80.68 

78.69 

80.08 

82.49 

80.04 

82.51 

74.98 

N dr  a  =  50 

TNG 

87.52 

87.25 

87.45 

85.08 

87.59 

88.11 

87.34 

87.42 

78.69 

TST 

88.05 

87.88 

88.01 

85.95 

88.30 

88.71 

88.17 

88.05 

79.25 

Ndra  =  100 

TNG 

92.55 

92.44 

93.27 

90.95 

92.93 

92.41 

92.92 

92.01 

85.85 

TST 

92.86 

92.94 

93.56 

91.51 

93.52 

92.65 

93.56 

92.30 

86.24 

Ndra  =  157 

TNG 

94.97 

95.54 

95.52 

92.95 

95.47 

95.97 

95.67 

95.59 

90.77 

TST 

95.39 

95.99 

95.89 

93.37 

95.89 

96.36 

96.16 

96.00 

91.48 

Ndra  =  191 

TNG 

96.36 

96.69 

96.34 

94.18 

96.41 

96.78 

96.50 

96.76 

91.07 

TST 

96.70 

97.13 

96.71 

94.54 

96.83 

97.13 

96.87 

97.19 

91.30 

Average  RAP 

86.02 

87.13 

86.18 

85.70 

87.45 

87.49 

87.47 

87.98 

80.60 

Cumulative  RAP 

1032.26 

1045.57 

1034.19 

1028.40 

1049.37 

1049.85 

1049.65 

1055.71 

967.20 
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4.3.3  DRA  Method  Verification  Performance  Assessments 


“One  vs  one”  device  claimed  ID  verification  performance  was  considered  to 
further  evaluate  each  DRA  classifier  model.  Figure  IV-27a  presents  authorized  device 
claimed  vs.  actual  ID  verification  assessment  for  UMax  and  N/.  =  50  at  SNR  =10  dB,  the 
SNR  at  which  the  baseline  Np  =  729  MDA/ML  classifier  achieves  %C  =  90%  accuracy. 
The  NAu,h  =  4  authorized  device  ROC  curves  presented  in  Figure  IV-27a  show  that  50% 
of  authorized  devices  are  correctly  authorized  at  TVR  >  90%  and  FRR  <  10%  using  this 
model.  Figure  IV-27b  similarly  shows  the  rogue  rejection  rate  for  the  UMax,  Nf=  50 
model  at  SNR  =10  dB.  At  the  threshold  of  TVR  >  90%  and  FVR  <  10%,  33/36  or  91.7% 
of  rogue  devices  were  correctly  rejected. 
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Figure  IV-27:  ZigBee  Device  ID  Verification  performance  for  the  Ndra 
feature  subset  at  SNR  =  10  dB.  Reprinted  from  [135]. 
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False  Verification  Rate  (FVR) 

(b)  Rogue.  Based  on  TVR  >  90%  and 
RAR  <  10%  criteria  (solid  lines),  this 
reflects  RRR  =  33/36  =  91.7%  success 

50  UMax 


To  visually  examine  the  results  from  the  MDA  classifiers  developed  from  the 
DRA  algorithms  and  the  DRA  assessment  methods,  a  total  of  108  ROC  curve  figure  pairs 
would  be  needed.  Results  were  therefore  generated  for  all  cases  and  are  summarized  in 
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Table  IV-8.  Here  bold  entries  denote  values  within  10%  of  the  Best,  and  bold  entries 
with  light  grey  shading  denote  best  case  performance.  With  the  exception  of  Random 
selection  results,  which  logically  offer  the  poorest  perfonnance  for  all  Ndra  subsets,  two 
observations  can  be  made:  firstly,  that  all  DRA  other  selection  methods  offer  comparable 
verification  performance  for  higher  Ndra  subsets,  e.g.  Ndra  =  [157,  191],  and  that  MLF- 
based  methods  generally  consistent  and  generally  superior  perfonnance  for  lower 
dimensional,  e.g.  Nf  =[10,26],  subsets.  Consequently  the  verification  performance 
results  concur  with  the  observations  seen  in  the  classification  results. 
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Table  IV-8:  Device  ID  Verification  Performance  For  %C  =  90%  at  SNR  =  10  dB:  True  Verification  Rate  (TVR)  for 
NAuth  =  4  Authorized  Devices  and  Rogue  Rejection  Rate  (RRR)  For  NAuth  xTV R0g  =  36  rogue  scenarios.  Bold  entries  denote 
values  within  10%  of  the  Best,  and  bold  entries  with  light  grey  shading  denote  best  case  performance  and.  Reprinted  from 

[135]. 


DRA  Subset 

DRA  Method 

Pre-Classification 

Post-Classification 

Baseline 

KS 

F-Test 

GRLVQI 

Wilk’s 

MLF 

SMax 

MLF 

SSum 

MLF 

UMax 

MLF 

USum 

Random 

Ndra  =  10 

TVR  (%) 

0 

25 

0 

25 

25 

50 

25 

50 

0 

RRR  (%) 

36.11 

52.78 

19.44 

41.67 

38.89 

36.11 

38.89 

50 

31.48 

Nora  ~  26 

TVR  (%) 

50 

50 

50 

50 

50 

50 

50 

50 

25 

RRR  (%) 

69.44 

72.22 

80.56 

63.89 

75 

75 

77.78 

75 

51.85 

Ndra  =  50 

TVR  (%) 

50 

75 

50 

75 

50 

50 

50 

50 

50 

RRR  (%) 

86.11 

91.67 

91.67 

83.33 

91.67 

91.67 

91.67 

88.89 

75 

Ndra  =100 

TVR  (%) 

75 

75 

100 

75 

75 

75 

75 

75 

66.67 

RRR  (%) 

94.44 

94.44 

94.44 

94.44 

94.44 

94.44 

94.44 

94.44 

86.11 

Ndra  =  157 

TVR  (%) 

100 

100 

100 

100 

100 

100 

100 

100 

75 

RRR  (%) 

94.44 

94.44 

94.44 

94.44 

94.44 

94.44 

94.44 

94.44 

91.67 

Ndra  =  191 

TVR  (%) 

100 

100 

100 

100 

100 

100 

100 

100 

75 

RRR  (%) 

97.22 

97.22 

94.44 

94.44 

97.22 

97.22 

97.22 

97.22 

91.67 
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V.  Extensions  to  the  LVQ-Family  of  Algorithms 


The  ant,  viewed  as  a  behaving  system,  is  quite  simple.  The  apparent  complexity  of  its 
behavior  over  time  is  largely  a  reflection  of  the  complexity  of  the  environmen  t  in  which  it 

finds  itself 

-Herbert  A.  Simon,  1916-2001 

While  various  studies  have  extended  Learning  Vector  Quantization  (LVQ) 
algorithms  by  considering  non-Euclidean  distance  measures,  the  extensions  are  not 
always  correctly  formulated  and  the  reason(s)  for  considering  alternate  measures  is  not 
always  clear.  Below,  the  Generalized  Relevance  Learning  Vector  Quantization  Improved 
(GRLVQI)  process  is  fundamentally  extended  via  a  process  to  select  and  incorporate 
alterative  distance  measures.  As  discussed  in  Chapter  III,  differences  in  LVQ  algorithms 
generally  revolve  around  cost  functions  and  hence  changing  distance  measures  involves 
deriving  new  update  equations. 

5.1  Introduction 

Herein,  overall  LVQ  algorithm  considerations  include  the  following: 

1)  a  minor  general  improvement  to  LVQ  algorithms  is  made  by  using  a  scaled 
gradient  descent  which  enables  direct  comparison  of  learning  rates  between 
problems; 

2)  approaches  for  selecting  the  number  of  Prototype  Vectors  (PVs)  are 
considered; 

3)  a  derivative  skeleton  framework  is  created  to  generalize  the  process  for 

incorporating  alternate  distance  measures  into  LVQ,  Relevance  Learning 
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Vector  Quantization  (RLVQ),  Generalized  Learning  Vector  Quantization 
(GLVQ),  Generalized  Relevance  Learning  Vector  Quantization  (GRLVQ) 
and  GRLVQI  algorithms; 

4)  a  methodology  is  formalized  for  proper  selection  and  incorporation  of 
distance  measures  and  learning  rates; 

5)  a  new  cost  function  is  presented  for  GLVQ,  GRLVQ,  and  GRLVQI 
algorithms  to  permit  a  wide  variety  of  distance  measures  to  be  considered; 

6)  a  design  of  experiments  (DOE)  methodology  with  Analysis  of  Variance 
(ANOVA)-based  response  surface  methods  and  optimization  of  algorithm 
parameter  settings  through  sequential  quadratic  programming  (SQP)  are 
employed  to  find  optimal  operating  points.  The  primary  benefit  of  these 
improvements  is  that  finding  appropriate  algorithm  parameter  settings  is 
optimized  and  a  systematic  process  for  deciding  which  distance  measure  to 
use  in  LVQ  algorithms  is  developed  and  considered. 

The  resultant  improved  GRLVQI  algorithm  is  tenned  GRLVQI-Distance 
(GRLVQI-D)  to  indicate  the  algorithm  is  generic  and  can  be  adopted  to  use  any 
differentiable  distance  measure.  Additionally,  similar  extensions  to  the  GLVQ  and 
GRLVQ  algorithms  are  made  with  these  extended  algorithms  tenned  GLVQ-D  and 
GRLVQ-D,  respectively. 

This  chapter  is  organized  as  follows.  Firstly,  algorithmic  development  aspect 

relative  to  LVQ  through  GRLVQI  are  presented  in  Section  5.2.  The  GRLVQI-D 

algorithm  is  presented  in  Section  5.2. 2.4  and  a  procedure  is  developed  and  applied  in 
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Section  5.3  for  selecting  distance  measures  for  GRLVQI-D.  GRLVQI-D  is  extended  to 
RF-DNA  Fingerprinting  in  Section  5.4. 

5.2  GRLVQI-D  Algorithm  Development 

High  levels  of  dimensionality  are  known  to  adversely  affect  Euclidean  distance 
based  classifiers  [470,  471],  which  is  directly  relevant  to  RF-DNA  applications  of  LVQ 
algorithms  since  RF-DNA  fingerprint  features  generally  have  a  large  number  of  features 
and  exemplars.  Therefore,  incorporating  a  non-Euclidean  distance  metric  in  GRLVQI 
could  be  advantageous.  However,  to  incorporate  a  non-Euclidean  distance  measure  the 
underlying  cost-function  must  be  changed  in  a  given  LVQ  algorithm. 

5.2.1  Prior  Implementations  of  non-Euclidean  Distances  in  LVQ 

In  LVQ  algorithms,  a  gradient  descent  is  used  with  the  step  size  a  function  of  the 
cost  function.  A  gradient  descent  implicitly  requires  evaluating  the  gradient  of  the 
associated  cost  function;  therefore,  a  new  PV  update  expression  must  be  computed  for 
any  change  in  the  distance  equation  or  cost  function.  GRLVQ  and  GRLVQI  were 
developing  using  squared  Euclidean  distance  for  selecting  prototype  vectors  [245].  Other 
LVQ  variations  have  seen  improvement  through  difference  distance  metrics,  e.g.  the 
innovations  of  Schneider  et  al.  [298]  to  where  two  new  metrics  similar  in  form  to 
Mahalanobis  distances  were  incorporated  into  GRLVQ. 

Common  issues  in  LVQ  distance  measure  extensions  is  neglecting  to  compute  a 
new  PV  gradient  descent  update  equation  when  considering  alternative  distance  equations 
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and  incorrect  fonnulations,  c.f.  [472-475].  These  common  pitfalls  found  in  the  LVQ 
literature.  PV  update  equations  can  be  generalized,  per  Ji  et  al.  [476,  477],  as 

w(t  +  1)  =  w(t)  +  cx,  (5.1) 

where  c  is  a  scalar  and  x  is  the  PV  update.  However,  such  fonnulations  imply  that  c  is 
merely  a  scalar  step  size  when  in  fact  it  is  composed  of  both  the  learning  rate  and  a 
gradient  descent  specified  quantity.  This  is  an  important  distinction  since  any  given  c  is 
specific  to  the  cost  function,  learning  rate,  and  the  distant  equation  employed. 

Biehl  et  al.  [290]  created  distance  measure  variants  for  GRLVQ;  however,  the 
process  presented  in  Biehl  et  al.  is  not  easily  generalizable  to  other  distances  and  the 
equations  are  presented  with  non-intuitive  formulations.  Strickert  et  al.  [291]  formulated 
a  GRLVQ  variant  using  a  correlation  based  measure  and  provided  justification  for  using 
both  distance  metrics  and  measures;  however,  the  fonnulation  skipped  over  multiple 
steps  to  make  it  generalizable  to  other  problems.  When  a  different  distance  measure  is 
used  direction  of  the  PV  update  must  be  considered  relative  to  the  direction  of  the 
distance  measure  [291].  The  solution  adopted  herein  and  suggested  by  Strickert  et  al. 
[291]  is  to  merely  flip  the  signs  on  the  PV  update  equations  [291]. 

However,  all  of  these  approaches  created  specific  fonnulations  and  were  not 
readily  generalizable.  Since,  the  process  and  equations  presented  for  these  applications  is 
not  always  intuitive  or  conectly  followed,  creating  a  general  framework  to  facilitate 
fonnulating  PV  update  equations  is  beneficial.  To  create  such  a  framework,  the  process 
used  to  formulate  PV  update  equations  must  be  understood  and  components  identified 
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that  need  to  be  changed  whenever  a  new  distance  equation  is  to  be  used.  Therefore,  to 
avoid  any  confusion,  the  entire  PV  update  equation  is  reported  herein. 

5.2.2  Developing  a  Differentiation  Skeleton  for  LVQ  Improvements 

The  following  general  improvements  are  made  to  LVQ  algorithm.  First,  Section 
5.2.2. 1  presents  a  scaled  gradient  descent  method  for  any  LVQ  algorithm  to  enable  direct 
comparison  of  learning  rates.  Then  Section  5. 2.2. 2  discusses  gradient  descent 
considerations  when  making  changes  to  LVQ  algorithms,  supporting  derivations  are 
provided  in  Appendices  E  and  F.  Cost  function  extensions  to  GLVQ,  GRLVQ,  and 
GRLVQI  are  discussed  in  5.2. 2. 3  and  Appendix  G.  Finally,  relevance  derivatives  for 
GRLVQ  and  GRLVQI  algorithms  are  discussed  in  discussed  in  5. 2.2. 3  and  Appendix  H. 
A  differentiation  skeleton  for  incorporating  any  differentiable  distance  measure  in  LVQ, 
RLVQ,  GLVQ,  GRLVQ,  and  GRLVQI  is  then  presented  in  5. 2.2. 4. 


5.2. 2.1  Sealed  Gradient  Descent 

Widrow-Hoff  (W-H)  learning  is  a  least  mean  squares  fonnulation  for  the  gradient 
descent  [243;  250,  pp.  55-57;  478],  W-H  considers  a  squared  Euclidean  distance  metric 
(e)  for  general  gradient  descent  updating  of  LVQ  [250,  pp.  55-57;  478].  The  gradient  of 
function /is  given  by 


VfK  = 


'dfK  dfK  dfK 


(5.2) 


xdX1’dX2-’dXp)’ 

where  K  is  the  step  number  and  p  is  the  number  of  variables  [243].  From  (5.2),  a  gradient 
search  for  a  maximum  can  be  computed  via 
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~  SVfi 
Xi+ 1  =Xt  +  TT^ 


(5.3) 


where  5  is  the  learning  rate  or  step  size  [243].  Given  5/||V/j||  is  a  scalar,  the  scaled 
learning  rate  can  be  incorporated  in  other  gradient  descents.  Considering  the  gradient 
descent  algorithm  in  (3.20),  it  can  be  rewritten  as 

we(t  +  1)  =  we(t)  +  e*(t)S7e  ,  (5.4) 


where, 


e*(0  = 


KO 

llVell 


(5.5) 


The  underlying  advantage  of  incorporating  (5.4)  and  (5.5)  in  LVQ,  RLVQ,  GLVQ, 
GRLVQ  and  GRLVQI  is  that  it  enables  a  direct  comparison  of  learning  rates  across  LVQ 
methods  and  datasets  without  significantly  changing  the  algorithms. 


5.2. 2. 2  Gradient  Descent  and  Derivatives  in  L  VO  Algorithms 

To  incorporate  a  non-Euclidean  distance  measure  in  LVQ,  we  must  consider  the 
gradient  computation,  as  seen  in  (3.20)  and  discussed  in  Section  3.3.1,  of  the  cost 
function  C(wn(t)).  For  LVQ,  the  cost  function  is  the  distance  measure  itself.  Therefore, 
creating  a  non-Euclidean  distance  LVQ  algorithm  requires  1)  selecting  a  distance 
measure  to  replace  (3.21),  and  2)  updating  the  cost  function  by  computing  the  first 
derivative  of  the  new  measure  to  replace  Xj  —  we(t)  in  (3.24)  and  (3.25).  The 
appropriate  in-class  PV  signs  would  then  be  computed  per  the  derivative  and  then 
considered  with  respect  to  what  the  new  measure  represents. 
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(a)  Gradient  Descent  in  RLVQ  Relevance  Computation 


Per  the  discussion  in  both  Section  3. 3. 1.4  and  [266],  the  RLVQ  expression  in 
(3.31)  is  also  computed  via  a  gradient  descent.  Thus,  when  changing  a  distance  metric  in 
RLVQ  it  is  necessary  to  change  the  cost  function.  When  considering  the  RLVQ  gradient 
descent  in  (3.29),  the  cost  function  for  RLVQ  is  the  distance  in  (3.30).  The  product  rule 
for  derivatives  is, 

d(uv )  =  udv  +  vdu  (5.6) 

where  u  and  v  are  two  different  variables  [279].  For  the  RLVQ  cost  function,  one  logical 
choice  would  be  u  =  xp  and  v  —  (xt  —  wn)2  ,  which  is  considered  for  dd/dxp ,  the 
derivation  of  the  distance  d  with  the  respect  to  xp .  This  results  in  the  following 
derivation: 

jj  =  ip-  0  +  1  ■  (x(t)  -  w(t))2 

dx P  (5.7) 

-  (*(t)  -  w(t))2 

with  the  final  expressing  being  the  expression  in  (3.31)  with  the  sign  being  associated 
with  convention  where  smaller  values  indicate  higher  significance  and  larger  values 
indicate  lower  significance  [266]. 

(b)  Gradient  Descent  in  GL  VO,  GRL  VO,  and  GRL  VQI 

Although  the  gradient  descent  derivations  for  LVQ  and  RLVQ  appear  trivial,  as 
discussed  in  Section  5.2.2.2(a),  the  derivations  are  non-trivial  when  the  gradient  descents 
are  computed  for  GLVQ,  GRLVQ  and  GRLVQI.  To  fully  understand  the  process, 
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derivations  for  the  PV  update  gradient  descent  operations  and  relevance  gradient  descent 
are  discussed  in  Appendices  E  through  H. 


5.2.23  Cost  Function  Extensions  for  the  GLVQ  Family  of  Algorithms,  the  GLVQ-D, 
GRLVQ-D,  and  GRLVQI-D  Algorithms 

The  nominal  relative  distance  difference  equation  for  GLVQ,  GRLVQ,  and 
GRLVQI  presents  issues  when  non-Squared  Euclidean  distance  measures  are  used.  For 
this  equation  to  yield  the  expected  values  between  -1  and  +1,  it  assumes  that  the  distance 
measure  yields  a  positive  value.  When  changing  the  distance  measure  to  a  non-squared 
Euclidean  distance  one  is  not  ensured  of  the  distance  being  positive.  Hence  selecting  an 
appropriate  relative  distance  difference  equation  is  necessary.  Two  obvious  approaches 
were  considered:  an  absolute  value  measure,  where  the  absolute  value  of  each  distance  is 
taken,  and  a  squared  measure,  where  each  distance  is  squared.  The  absolute  value 
approach,  which  would  consider 


M(xm)  = 


(I \dJ\-\dL\ ) 


(5.8) 


(\dJ\  +  |dL|)  ' 

has  notable  issues  and  was  not  developed  further  because  this  would  require  an  overly 
complex  gradient  descent  method  due  to  there  being  three  conditions  of  absolute  value 
derivatives:  positive,  negative,  and  0  when  the  function  itself  is  continuous  but  not 
differentiable  at  0  [479].  Therefore,  only  an  improved  squared  relative  distance 
difference  function  will  be  developed  and  considered. 
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In  order  for  the  new  relative  distance  difference  equation  to  compute  the  same 
scores  for  the  nominal  squared-Euclidean  distance  measure,  the  following  improved 
equation  was  developed, 


M(xm)  = 


(dQ2  -  (dL)2 


(5.9) 


(dJ)2  +  (dL)2' 

where  each  distance  is  ensured  to  be  positive.  However,  by  changing  fi(xm)  a  new 
GLVQ  gradient  descent  must  necessarily  be  computed,  per  Section  5.2.2.2(b).  The 
derivation  for  the  new  GLVQ  gradient  descent  is  presented  in  Appendix  G,  with  the 
resultant  PV  update  becoming 


u,  ,  ,  8e(t)(d/M*(xm))dL 

wj(t+  1)  =W7(0+  - (dJ  -  dLy -  (Xm-Wjy 

wK(t  +  1)  =  wL(t) - ^  -  ^L)3- 


(5.10) 


{dl  +  dL)2 

which  differs  from  the  PV  updates  in  (3.35)  only  by  the  scalar  multiplier  and  the  squared 
tenns  in  the  relative  distance  difference  equations. 

When  considering  GRLVQ  or  GRLVQI,  one  must  also  update  the  relevance 
gradient  descent  if  the  relative  distance  difference  equation  has  been  changed.  Appendix 
H  presents  this  process  for  (5.9)  and  yields  a  new  relevance  update, 


%(?+  1)  =  xpq(t) 
-  e(0/'U(xm) 


2{d]  -  dL)(xjq(t)  -  wnq(t)Y 
0 dJ  +  dL)2 


(5.11) 


which  is  equivalent  to  the  GRLVQ  relevance  update  in  (3.37)  prior  to  being  multiplied 
and  written  out.  Following  the  considerations  of  Section  3. 3. 1.6  and  Appendices  E 
through  H,  the  underlying  GRLVQ  gradient  descent  PV  gradient  descent  is  thus, 
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,  ,  8  e(t)(df /dfi(xm))dL  . 

wJ(t  +  1)  =  wJ(t )  + -  —  * 2 - V  ( xm  -  wJ)3 

8  e(t)(df /dpL(xmy)d]  , 

wKa + 1) = wL(t )  — (74^)2  v  (xm  ~ w  )  ■ 


(5.12) 


5.2. 2. 4  A  Differentiation  Skeleton  for  L  VO  Distance  Metrics 

Examining  the  derivation  process  that  yields  the  PVs  updates  for  LVQ,  RLVQ, 
GLVQ,  or  GRLVQ,  one  can  notice  a  few  patterns.  Firstly,  while  the  gradient  descent 
cost  function  in  LVQ  and  GLVQ  differs  dramatically,  one  will  compute  the  same  first 
derivative  for  a  given  distance  metric  for  both  algorithms  since  the  distance  metric  is  the 
cost  function  in  LVQ,  per  (3.20)-(3.25)  .  In  GLVQ,  the  distance  metric  first  derivatives 
are  the  same  as  in  LVQ  except  for  denotation  for  the  appropriate  in-class  and  out-of-class 
distances,  (E.17)-(E.20),  however  additional  derivatives  must  be  computed  for  the  cost 
function,  (E.1)-(E.5),  and  the  relative  distance  difference  equation,  (E.7)-(E.16).  These 
must  then  be  assembled;  however,  these  are  noticeably  identical  when  changing  distance 
measures  except  for  (possibly)  sign  and  the  appropriate  in/out  of  class  subscript. 
Additionally,  as  long  as  the  same  logistic  sigmoid  cost  function  is  employed  per  (E.l)- 
(E.5)  then  one  does  not  need  to  recompute  its  derivative, /'(/rCx”1)).  Similarly,  the 
derivatives  in  RLVQ  and  GRLVQ  are  closely  related  to  the  derivative  computed  for  their 
respective  cost  function. 

As  long  as  the  underlying  gradient  descent  process  in  (3.20)  is  not  changed,  the 
derivative  approach  will  be  consistent.  It  is  intuitively  obvious  to  the  casual  observer  that 
as  long  as  both  the  difference  equation  in  (3.34)  is  used,  then  general  quotient  rule 
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process  in  (E.6)  will  be  consistent  and  therefore  changing  the  distance  metric  in  a  GLVQ 
type  of  gradient  descent  process  merely  involves  computing  the  following 
derivatives  du/dw] ,  dv/dw1 ,  du/dwL  and  dv/dwL  and  then  only  computing  the 
resultant  equation  via  the  quotient  rule. 

Following  the  above  knowledge,  Figure  V-l  presents  decomposition  of  GLVQ, 
GRLVQ  and  GRLVQI  gradient  descents  and  from  where  each  respective  part  is 
computed.  Using  this  knowledge,  one  can  determine  which  component  of  the  gradient 
descent  needs  to  be  updated  based  upon  which  change  in  the  algorithm.  For  example,  if 
only  the  distance  measure  is  changed,  then  only  the  component  in  red  needs  to  be 
changed;  care  must  be  taken  with  the  scalar  multiplier,  since  this  is  a  function  of  both  the 
distance  measure  and  relative  distance  difference,  and  it  could  further  also  be  a  function 
of  the  cost  function,  depending  on  what  is  changed. 

Observable  in  Figure  V-l  is  that  this  visualization  is  generalizable  to  LVQ  as  well 
as  GLVQ  algorithms.  For  instance,  in  LVQ  and  RLVQ,  the  cost  function  of  the  gradient 
descent  is  the  distance  measure  itself  and  thus  the  distance  measure  and  relative  distance 
difference  measure  related  components  of  Figure  V-l  are  not  considered  and  one  only 
computes  the  derivative  of  cost  function.  One  can  further  similarly  observe  relevance 
updates  as  seen  in  Figure  V-2.  Extending  from  these  observations,  an  algorithmic 
skeleton  for  making  various  changes  to  LVQ,  RLVQ,  GLVQ,  GRLVQ,  and  GRLVQI  is 
presented  in  Figure  V-3. 
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Figure  V-l:  Components  of  GLVQ,  GRLVQ  and  GRLVQI  Gradient  Descents. 


|  =distance  measure  related 
_  =cost  function  related 
I  ^relative  distance  difference  metric  related 

Figure  V-2:  Components  of  GLVQ,  GRLVQ  and  GRLVQI  Gradient  Descents. 
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Algorithm  3  LVQ  Derivative  Framework 


Select  new  distance  metric  d(x,w) 
ifdd(x,  w)/d  w  exists  do 

Compute  VC(w(t))  =  dd(x,  w)/dw 

Insert  VC(w(t))  into  LVQ  algorithm  per  w(t  +  1)  =  w(t)  —  e(t)VC(w(t)] 
Use  new  d(x,  w)  in  argmirij  (X  d(xit  uq)) 
end 

if  RLVQ  expression  desired 

Extend  d(x,  w)  function  to  include  relevance 

Compute  dd( x,  w )/  dip 

Extend  LVQ  function  to  include  logic  for  relevance  computation 

end 

if  GLVQ  expression  desired 

Select  cost  function,  f(ji(xm')),  and  distance  measure  n(xm) 

Compute  derivative  for  cost  function  f(p.(xm))  via 

a/frc*"1))  =  a/fcum))  dfl(xm) 

dw  dfi(xm)  dw 

Compute  derivative  for  sigmoid: 

df(  n(xm))  .  . 

Consider  sigmoid  distance  metric  and  compute  for 
dn(xm)/dw}  &  dn(xm)/dwL 

if  u(xm)  =  (d/~dL) 

11  ^ x  J  (d/+dQ 

Compute: 

dn(xm )  _  dfi(xm)/dw>(2d))  dn(xm )  _  dfi(xm)/dwL(-2dL ) 
dw J  {d) +dL)2  an  dwL  (dJ+dL) 2 

else 

Compute  new  derivative  expression  for  distance  measure 

end 

Assemble  equations 

end 

if  GRLVQ  or  GRLVQ1  expression  desired 

Follow  procedure  for  GRLVQ 
Compute  dd{ x,  w)/dip 
Assemble  equations 
end 
end 


Figure  V-3:  Pseudocode  Process  and  Derivative  Skeleton  for  Changing  Distance 
Metrics  in  LVQ,  RLVQ,  GLVQ,  and  GRLVQ. 
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5.3  Selecting  Distance  Measures  for  GRLVQI-D 

With  the  GRLVQI-D  algorithm  formalized,  one  must  now  detennine  which 
distance  measure  should  be  incorporated.  However,  the  process  presented  in  Section 
5.2.2  being  fonnalized,  it  is  still  non-trivial  considering  the  various  derivatives  and 
computations.  It  is  additionally,  non-intuitive  on  which  distance  measure  to  select. 
Appendix  I  reviews  various  distance  measures  as  described  by  Cha  [283]  in  his  review  of 
distance  measures. 

A  general  distance  measure  selection  process  for  LVQ  algorithms  is  therefore 
presented  due  to  1)  the  long  list  of  possible  distance  measures,  2)  the  involved  derivation 
process  required  to  implement  a  new  distance  measures  into  GRLVQ  or  GRLVQI,  3)  the 
large  amount  of  data  and  computation  time  needed  for  RF-DNA  applications,  and  4)  no 
extant  guidance  on  which  distance  measures  should  be  considered.  The  proposed 
distance  measure  selection  process  innovates  via  the  following,  1)  distance  measures  are 
first  compared  via  correlation  on  two  random  vectors,  2)  uncorrelated  distance  measures 
then  are  then  selected  via  statistical  clustering,  then  3)  the  gradient,  first  derivatives,  of 
these  measure  are  computed  and  LVQ  performance  is  examined  on  an  academic  problem 
dataset,  and  finally,  4)  measures  that  offer  good  perfonnance  in  LVQ  are  then  examined 
in  RLVQ,  GLVQ,  and  GRLVQ.  Underperforming  distance  measures  are  not  considered 
in  subsequent  algorithms,  e.g.  a  measure  that  perfonns  poorly  in  LVQ  is  not  considered 
in  RLVQ,  due  to  the  general  belief  that  if  one  cannot  solve  a  simple  problem  then  one 
will  have  difficulties  solving  more  complex  problems.  Figure  V-4  presents  the  general 
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methodology  for  selecting  distance  measures  and  developing  distance  measure  variants  of 


GRLVQI. 


Figure  V-4:  Iterative  Process  for  Selecting  Distance  Metrics  for  GRLVQI. 
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5.3.1  Selecting  Distance  Measures  for  Consideration 

Cha  [283]  identified  62  different  distance  measures  and  metrics,  which  can  be 
grouped  into  9  related  groups  as  described  in  Appendix  I:  Minkowski,  L/,  Intersection, 
Inner  Product,  Fidelity,  Squared  L2,  Shannon’s  entropy,  Combinations,  and  Vicissitude. 
However,  many  of  these  distance  metrics  are  highly  related,  correlated,  or  contain  non- 
differentiable  factors.  Therefore,  only  a  few  were  evaluated  for  GRLVQI  and  measures 
employing  maximization  or  minimization  were  not  considered  due  to  the  dubious 
derivations  [480],  Considering  the  excluding  factors,  22  measures  remained  for 
consideration:  Euclidean,  City  Block,  Squared  Euclidean,  Sorensen,  Canberra,  Inner 
Product,  Harmonic  Mean,  Cosine,  Pseudo-Cosine,  Kumar-Hasselbrook,  Jaccard,  Dice, 
Pearson  y  ,  Neyman  y  ,  Squared  y  ,  Divergence,  Additive  Symmetric,  Kumar- Johnson, 
Covariance,  Correlation,  Mahalanobis,  and  Squared  Mahalanobis. 

5.3.2  Comparing  Potential  Distance  Metrics  via  Correlation 

To  understand  how  the  remaining  22  distance  measures  were  related,  a  correlation 
study  was  posed  where  distance  measures  are  grouped  based  upon  correlation  of  results 
and  only  dissimilar  distance  measures  are  selected  for  further  analysis  for  incorporation 
into  LVQ.  To  quantify  the  correlation  between  distance  measures,  two  uncorrelated 
random  normal  vectors  of  length  1,000  were  pennutated.  These  vectors  were 
uncorrelated  with  a  Pearson  correlation  coefficient  of  0.024.  These  uncorrelated  vectors 
were  then  inserted  for  P  and  Q  in  the  appropriate  equations  seen  in  Appendix  I,  and  then 
1,000  paired  distances  between  P  and  Q  were  then  computed. 
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Figure  V-5  presents  a  correlation  matrix  between  the  paired  distance  measures 
results.  A  few  observations  can  be  made  from  Figure  V-5,  firstly,  many  distance  metrics 
are  highly  correlated  only  within  Cha’s  [283]  ‘families’  or  groups;  secondly,  there  no 
measure  appears  highly  correlated  with  all  other  measures;  and  thirdly,  both  positive  and 
negative  correlations  are  seen.  Positive  and  negative  correlations  should  logically  be 
considered  with  respect  to  the  nominal  squared  Euclidean  measure;  measures  that  are 
negatively  correlated  with  the  squared  Euclidean  measure  logically  have  larger  values  for 
more  similar  exemplars  and  smaller  values  for  more  different  exemplars,  consistent  with 
[481],  when  employing  measures  negatively  correlated  to  Squared  Euclidean  distance 
one  desires  to  maximize  the  distance  rather  than  minimize. 
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Euclidean 
City  Block 
Squared  Euclidean 
Sorensen 
Canberra 
Inner  Product 
Harmonic  Mean 
Cosine  Measure 
Psuedo- Cosine 
Kumar- Has  sebrook 
Jaccard 
Dice 
Pearson  Chi' 
Neyman  Chi' 
Sqaured  Chi' 
Divergence 
Additive  Symmetric 
Kumar-Johnson 
Covariance 
Correlation 
Mahalanobis 
Squared  Mahalanobis 


Figure  V-5:  Correlation  Comparison  of  Distance  Metrics  on  Random  Normal  Data. 


To  select  distance  measures  for  inclusion  into  GRLVQI  hierarchical  clustering, 
consistent  with  [482],  was  used  to  find  groups  of  distance  measures.  Hierarchical 
clustering  considers  a  distance  matrix  between  variables  and  then  applies  a  linkage 
approach  to  determine  how  variables  are  connected  [448].  For  a  distance  matrix,  the 
correlation  matrix  from  Figure  V-5  was  used  since  this  is  the  relative  distance  of  interest. 

A  dendrogram,  a  diagram  employed  in  cluster  analysis  to  show  partitions  and 
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closeness  of  variables  [236],  is  presented  in  Figure  V-6.  Figure  V-6  is  viewed  and 
interpretted  as  follows:  the  y-axis  indicates  closeness  of  variables,  and  ranges  from  0 
(similar)  to  a  maximum  of  4  (distant)  [449].  At  the  maximum  value,  all  variables  are 
linked  together,  heading  towards  zero  (where  only  similar  variables  are  linked)  groups 
are  detennined  through  an  appropriate  linkage  method  [449,  450],  The  complete  linkage 
method,  which  finds  most  distant  pairs  and  groups  less  distance  pair  together  [236,  451], 
was  used  to  evaluate  closeness. 


178 


3.5 


2.5 


X 

X 

1) 

r- 

53 

X 


u 


1.5 


0.5 


0 


.  i  .  .  i 

e  r  1  r  1  CD.- 

t 


•  n 


t-  „c  «c  .7: 

13  c  U  U  q  i 

|  -s  E  =  ic 

“  E  * 


_  u  c  c  ^  t*  i> 

£  _o  O  U  7j  Tj  -  ^  3 

u  e  '5  13  2 

-  _  1  «  u  e  u  ^  d  >,  p-  3  •  w- 

>'  i  fc  £ 5/3  %  0  c  =  £  tn  to  n  k  0  3  r«: 

|  s*  8  1  g§-P  -aucOU^ 

^Iiz^sEca  £  £  *K 

>=  ^  ^  1  3  ° 

T3 
33 

< 


ZT 

(Si 


3 

cr 

(Si 


Figure  V-6:  Dendrogram  with  Complete  Linkage  and  Correlation  Matrix, 
from  Figure  V-5,  as  Distance  Matrix. 


The  number  of  clusters,  and  hence  number  of  distance  measures  to  consider,  was 
determined  by  setting  a  subjective  closeness  threshold  by  considering  how  far  apart  the 
groupings  in  Figure  V-6  appear.  A  threshold  of  0.5  was  used,  resulting  in  9  clusters  to 
consider.  A  “Chinese  Menu”  approach,  consistent  with  [486-491],  was  then  used  to 
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select  distance  measures  wherein  one  method  from  each  group  was  selected.  To  facilitate 
derivations  and  inclusion  into  LVQ  algorithms,  the  simplest  distance  equation  in  each 
group  was  selected.  This  resulted  in  the  following  nine  distance  measures  being  further 
considered:  Additive  Symmetry,  Neyman  Chi",  Pearson  Chi',  Sorensen,  Pseudo-Cosine, 
Canberra,  Squared  Euclidean,  Cosine,  and  Squared  Mahalanobis. 

5.3.3  Determining  Suitable  Distance  Measures  and  LVQ  Algorithm  Settings 

To  understand  how  LVQ  distance  measure  extensions  behave  for  various 
operating  points,  a  small  academic  dataset  was  considered  and  learning  and  relevance 
rates  were  considered  for  each  LVQ  distance  measure  variant.  As  underperforming 
algorithms  were  found,  they  were  not  considered  further,  e.g.  poorly  performing  LVQ 
distance  measure  variances  were  not  further  considered  in  RLVQ.  Lisher  Iris  [235],  a 
small  academic  dataset,  was  considered  with  NF= 4,  V0/„  =  150,  with  data  equally  divided 
among  Nc=  3  classes.  Training  and  testing  sets  were  segregated  by  taking  the  first  45 
observations  from  each  class  for  training  with  the  remaining  5  observations  per  class 
considered  as  testing.  To  remove  randomization  issues,  100  iterations  were  considered 
with  the  classification  accuracy  averaged. 

Because  the  dynamic  range  and  values  computed  by  the  different  distance  metrics 
will  differ,  before  considering  RL-DNA  data  in  GRLVQI  first  the  relationship  between 
learning  rate  and  number  of  PVs  was  explored  in  LVQ  with  the  Lisher  Iris  academic 
dataset.  This  provides  an  understanding  of  how  each  measure  behaves  and  how  each 
measure  behaves  compared  to  the  nominal  squared-Euclidean  distance  metric.  This 
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approach  is  considered  iteratively,  as  described  in  Figure  V-3,  with  each  measure  first 
examined  in  LVQ,  then  RLVQ,  GLVQ,  and  finally,  GRLVQ.  As  measures  are  found  to 
offer  little  or  no  performance  benefits,  they  are  removed  from  consideration  in  further 
iterations  (e.g.  if  a  given  measure  performs  poorly  in  LVQ,  it  is  not  examined  in  RLVQ, 
GLVQ,  or  GRLVQ)  since,  logically,  if  a  measure  offers  poor  perfonnance  and  relatively 
little  understanding  of  its  behavior  in  a  simple  algorithm  it  will  be  difficult  for  it  to  offer 
good  performance  in  a  complex  algorithm. 

In  each  algorithm  the  normalized  learning  and  relevance  rates  were  considered  for 
8  different  levels  as  presented  in  Table  V-l.  These  settings  provide  various  conditions 
around  the  nominal  LVQ  settings,  as  described  in  Section  3.2. 1.8.  For  RLVQ  and 
GRLVQ  each  combination  of  learning  and  relevance  rate  was  explored. 

Table  V-l:  Learning  and  Relevance  Rates  for  LVQ  Algorithm  Experiment. 


Level 

Learning 

Relevance 

Rate 

Rate 

1 

0.0001 

0.0001 

2 

0.001 

0.001 

3 

0.01 

0.01 

4 

0.1 

0.1 

5 

1.0 

1.0 

6 

10.0 

10.0 

7 

100.0 

100.0 

8 

1000.0 

1000.0 
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5. 3. 3.1  Determining  Suitable  Distance  Measures  and  LVQ  Algorithm  Settings 


Figure  V-7  presents  results  after  fonnulating  the  LVQ  cost  functions,  provided  in 
Appendix  J,  and  computing  perfonnance  results  for  each  LVQ  variation.  As  seen  in 
Figure  V-7,  only  5  LVQ  distance  measure  variants  achieve  training  or  testing 
classification  above  40%.  Squared  Euclidean  (the  baseline),  Cosine  LVQ,  and  Canberra 
LVQ  consistently  perform  above  60%  accuracy  for  learning  rates  above  0.1  and  thus 
these  methods  will  be  further  explored  for  other  LVQ  variations.  While  Neyman  %  and 
Sorensen  LVQ  variants  achieve  between  40  and  60%  classification  accuracy,  they 
perfonn  much  worse  than  Squared  Euclidean,  Cosine  and  Canberra  and  thus  Neyman  y~ 
and  Sorensen  LVQ  variants  are  not  considered  further. 


Squared  Euclidean,  Training 
Testing 
Cosine,  T raining 
Testing 

Squared  Mahalanobis,  T raining 
Testing 

Pearson  Chi  Squared,  T raining 
Testing 

Neyman  Chi  Squared,  Training 
Testing 

Additive  Symmetry,  Training 
Testing 

Pseudo-Cosine,  T raining 
Testing 
Sorensen,  T raining 
Testing 
Canberra,  Training 
Testing 

0.0001  0.001  0.01  0.1  1.0  10.0  100.0  1000.0 
Learning  Rate 

Figure  V-7:  Distance  Measure  Performance  versus  Learning  Rate  for  LVQ 
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533.2  Distance  Measure  Extensions  to  GLVQ 


As  discussed  in  Section  5. 2.2. 4,  changing  the  distance  measure  in  the  cost 
function  for  GLVQ  involves  merely  changing  the  distance  measure  component  of  the 
cost  function  derivative.  This  was  considered  for  Squared  Euclidean  (baseline),  Cosine, 
and  Canberra  measures.  Figure  V-8  presents  classification  results,  best  performance  is 
seen  for  learning  rates  of  0.01  and  0.1  for  Squared  Euclidean,  above  1.0  for  Cosine,  and 
at  0.1  for  Canberra.  Thus  one  could  interpret  this  as  indicating  that  Cosine  GLVQ  needs 
a  learning  rate  10-100  times  that  of  Squared  Euclidean  to  achieve  reasonable 
performance. 

Squared  Euclidean,  Training 
Testing 
Cosine,  Training 
Testing 
Canberra,  Training 
Testing 

0.0001  0.001  0.01  0.1  1.0  10.0  1  00.0  1  000.0 
Learning  Rate 

Figure  V-8:  Distance  Measure  Performance  versus  Learning  Rate  for  GLVQ 

5.3.4  Relevance  Learning  with  Alternative  LVQ  Distance  Measures 

Care  must  be  taken  when  incorporating  relevance  learning  in  distance  measures 
since  the  relevance  weighting  must  be  relative  to  each  feature.  In  RLVQ,  the  Euclidean 
distance  measure  of  (3.21)  is  formulated  so  that  the  relevance  multiplier  is  easily 
contained  inside  the  summation.  However,  it  is  not  always  obvious  where  to 
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incorporating  the  relevance  multiplier  on  different  distance  measure,  such  as  both  the 
Canberra  and  Cosine  measures. 

The  Canberra  measure  consists  of  a  summation  of  two  ratios;  to  ensure  the 
relevance  values  are  associated  with  features  and  not  PVs,  the  relevance  values  must 
therefore  be  a  Hadamard  product,  e.g.  [492],  to  ensure  appropriate  weighting  on  each 
feature.  Although  Sorensen  was  not  considered  beyond  LVQ,  its  fonnulation  as  a  ratio  of 
sums  would  increase  difficulties  in  incorporating  relevance  learning.  To  implement 
relevance  learning,  the  relevance  must  be  added  so  that  it  multiplies  to  each  feature,  for 
Canberra  the  following  relevance  distance  measure  appropriately  accomplishes  this, 


nf 
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(5.13) 


When  considering  the  Cosine  distance  measure,  one  sees  a  summation  of  a  ratio 
with  the  numerator  being  a  product  and  the  denominator  a  product  of  two  summations. 
To  avoid  an  overly  complicated  derivative  the  relevance  multiplier  was  added  to  only  the 
numerator,  with  the  Cosine  relevance  equation  appearing  as, 
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After  incorporating  relevance  learning  into  RLVQ,  using  the  formulations 
described  in  Appendix  J  and  Figure  V-3,  each  algorithm  was  considered  for  all  relevance 
rates  in  Table  V-l  and  learning  rates  associated  high  accuracy  (%C  >  60%)  from  Section 
5. 3. 3.2.  Classification  results  are  presented  in  Figure  V-9  through  Figure  V-ll  which 
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shows  the  relationship  between  learning  rates  and  relevance  rates  for  Squared  Euclidean, 
Canberra,  and  Cosine  RLVQ  algorithms. 

Figure  V-9  presents  the  relationship  between  classification  accuracy,  learning 
rates  and  relevance  rates  for  Squared  Euclidean  RLVQ  on  the  Fisher  Iris  dataset.  Evident 
in  Figure  V-9  is  that  the  best  performance  is  seen  when  the  relevance  rate  is  equal  to  or 
less  than  the  learning  rate,  consistent  with  [291].  Similarly,  Figure  V-10  presents 
Canberra-RLVQ  results  where  the  best  performance  is  seen  when  relevance  rate  is  less 
than  the  learning  rate  and  particularly  when  the  relevance  rate  is  equal  to  0.01  or  less. 
Finally,  Figure  V-ll  presents  classification  results  for  Cosine-RLVQ  wherein  one  sees 
that  the  best  performance  is  only  achieved  when  the  relevance  rate  is  less  than  the 
learning  rate  and  valued  0.0001  or  0.001. 


0.1  Learning  Rate,  T raining 

Testing 

1 .0  Learning  Rate,  T raining 

Testing 

0.0001  0.001  0.01  0.1  1  10  100  1000 
Relevance  Learning  Rate 

Figure  V-9:  Learning  Rate  vs  Relevance  Learning  Rate  for  Squared  Euclidean 

RLVQ 
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0.1  Learning  Rate,  Training 
Testing 

1 .0  Learning  Rate,  T raining 
Testing 

10.0  Learning  Rate,  Training 
Testing 

100.0  Learning  Rate,  Training 
Testing 

1000.0  Learning  Rate,  Training 
Testing 

0.0001  0.001  0.01  0.1  1  10  100  1000 
Relevance  Learning  Rate 


Figure  V-10:  Learning  Rate  vs  Relevance  Learning  Rate  for  Canberra  RLVQ 


0.1  Learning  Rate, 

1 .0  Learning  Rate, 

10.0  Learning  Rate, 

100.0  Learning  Rate, 

1000.0  Learning  Rate, 


Training 
Testing 
T  raining 
Testing 
Training 
Testing 
T  raining 
Testing 
Training 
Testing 


Relevance  Learning  Rate 


Figure  V-ll:  Learning  Rate  vs  Relevance  Learning  Rate  for  Cosine  RLVQ 


5.3.5  Distance  Measure  Extensions  to  GRLVQ  and  GRLVQI 

To  extend  Canberra-GLVQ  and  Cosine-GLVQ  to  include  relevance,  the 

considerations  of  the  process  in  Figure  V-3  and  Figure  V-4  were  applied  with  the  GLVQ 
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sigmoidal  cost  function  and  the  revised  relative  distance  difference  metric  of  Section 
5. 2.2. 3. 

5.3. 5.1  Relevance  Learning  and  GRL  VO  Extensions 

When  extending  the  distance  measure  fonnulations  to  GRLVQ,  the  considerations 
described  in  Figure  V-3  were  followed  wherein  the  distance  measure  versions  of  GLVQ 
were  extended  with  relevance  logic.  Figure  V-12  presents  the  relationship  between 
classification  accuracy,  learning  rates  and  relevance  rates  for  Squared  Euclidean  GRLVQ 
on  the  Fisher  Iris  dataset.  Consistent  with  Squared  Euclidean  GRLVQ  in  Section  5.3.4, 
evident  in  Figure  V-12  is  that  the  best  perfonnance  is  seen  when  both  the  learning  rate  is 
less  than  1.0  and  the  relevance  rate  is  less  than  the  learning  rate.  Similarly,  Figure  V-13 
presents  Canberra-GRLVQ  results  where  the  best  performance  is  seen  when  relevance 
rate  is  less  than  the  learning  rate.  Finally,  Figure  V-14  presents  classification  results  for 
Cosine-GRLVQ  wherein  perfonnance  is  consistent  with  Figure  V-ll  with  the  best 
performance  only  achieved  when  the  relevance  rate  is  less  than  the  learning  rate  and 
valued  0.0001  or  0.001. 
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0.01  Learning  Rate,  Training 
T  esting 

0.1  Learning  Rate,  Training 
T  esting 

1.0  Learning  Rate,  Training 
T  esting 

10.0  Learning  Rate,  Training 
T  esting 

0.0001  0.001  0.01  0.1  1  10  100  1000 


Relevance  Learning  Rate 

Figure  V-12:  Learning  Rate  vs  Relevance  Learning  Rate  for  Squared  Euclidean 

GRLVQ. 


0.1  Learning  Rate,  Training 
Testing 

1.0  Learning  Rate,  Training 
Testing 

10.0  Learning  Rate,  Training 
Testing 

100.0  Learning  Rate,  Training 
Testing 

1000.0  Learning  Rate,  Training 
Testing 


0.0001  0.001  0.01 


1  10  100  1000 


Relevance  Learning  Rate 

Figure  V-13:  Learning  Rate  vs  Relevance  Learning  Rate  for  Canberra  GRLVQ. 
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0.1  Learning  Rate,  Training 
Testing 

1.0  Learning  Rate,  Training 
Testing 

10.0  Learning  Rate,  Training 
Testing 

100.0  Learning  Rate,  Training 
Testing 

1000.0  Learning  Rate,  Training 
Testing 

0.0001  0.001  0.01  0.1  1  10  100  1000 
Relevance  Learning  Rate 

Figure  V-14:  Learning  Rate  vs  Relevance  Learning  Rate  for  Cosine  GRLVQ. 


5.3. 5.2  Distance  Measure  Extensions  to  GRL  VQI 

The  extension  of  GRLVQ  to  GRLVQI  involves  components  unrelated  to  the 
distance  measure,  PV  gradient  descent  update  or  relevance  gradient  descent  update. 
Therefore,  algorithmically,  the  Cosine  and  Canberra  versions  of  GRLVQ  were  extended 
to  GRLVQI  by  incorporating  the  improvements  of  Section  5.2.2. 

5.4  GRVLQI-D  Extension  for  RF-DNA  Fingerprinting 

To  extend  the  discussions  in  Sections  5. 3. 3-5. 3. 5  to  GRLVQI  for  RF-DNA 
problems,  a  few  general  aspects  must  be  considered:  1)  LVQ  architecture  selection  and  2) 
the  interaction  of  GRLVQI  factors  of  learning,  relevance  and  conscience  rates,  LVQ 
architecture  with  the  resultant  classification  and  verification  performance.  For  LVQ 
architecture  selection,  we  will  develop  heuristics  to  determine  the  number  of  PVs  to 
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instantiate  and  then  consider  the  general  impact  of  the  number  of  PVs  on  Squared 
Euclidean  GRLVQI  classification  and  verification  perfonnance.  To  understand  the 
interaction  of  these  GRLVQI  factors  with  perfonnance,  a  full  factorial  ANOVA 
experiment  will  be  considered  (using  Z-Wave  data)  with  response  surface  methods  used 
to  find  optimal  settings.  The  algorithmic  optimization  approach  is  of  particular  interest 
for  the  Cosine  and  Canberra  GRLVQI  algorithms  since  there  are  no  prior 
implementations  of  these  from  which  to  find  reasonable  settings. 

5.4.1  LVQ  Architecture  Selection  and  Specification 

As  noted  in  Section  3.3. 1.8,  the  literature  is  largely  silent  on  the  appropriate 
number  of  PVs,  learning  rates,  PV  initialization  process  except  that  one  should  use  as 
many  as  possible  [262]  and  that  one  needs  at  least  one  PV  per  class  [299].  However,  as 
seen  in  Schneider  et  al.  [298],  overfitting  can  occur  in  LVQ  if  too  many  PVs  are 
instantiated.  Additionally,  since  each  PV  must  be  moved  in  an  iterative  fashion, 
computation  times  necessarily  increase  when  more  PVs  are  considered.  Therefore  one 
should  endeavor  to  instantiate  a  quantity  of  PVs  that  achieves  good  accuracy,  avoids 
overfitting,  and  is  not  computationally  expensive. 

LVQ  overfitting  issues  appear  similar  to  overfitting  problems  in  ANNs,  as 
mentioned  in  [493],  could  suffer  from  similar  problems  as  well  since  it  is  also  a  neural 
learning  algorithm.  An  example  of  the  overfitting  effect  is  presented  in  Table  V-2  which 
shows  that  an  increasing  number  of  ANN  hidden  nodes  causes  an  increasing  in  training 
accuracy,  but  the  resulting  testing  set  accuracy  does  not  similarly  increase  and  reaches  a 
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peak.  LVQ  architecture  has  similarities  to  ANNs  and  hence  appropriately  specifying  the 
number  of  PVs  could  be  critical  to  general  LVQ  perfonnance.  While  the  number  of 
nodes  is  frequently  detennined  empirically,  e.g.  [494-498],  approaches  for  ANN 
architecture  development  exist  and  could  be  beneficial  to  LVQ  algorithm  performance. 


Table  V-2:  Example  of  ANN  Architecture  Effects  on  ANN  Performance,  reproduced 

from  [493], 


Input 

Nodes 

Hidden 

Nodes 

Output 

Nodes 

Training 
Accuracy  (%) 

Testing 
Accuracy (%) 

16 

10 

8 

84.5 

58.5 

16 

13 

8 

89.2 

65.9 

16 

15 

8 

93.5 

73.2 

16 

18 

8 

93.7 

70.7 

16 

20 

8 

99.5 

73.2 

16 

25 

8 

100 

58.5 

LVQ  methods  are  considered  to  be  generally  robust  to  overfitting,  as  noted  by 
Biehl  et  al.  [470]  and  attributed  to  the  Hebbian  learning  results.  However,  Schneider  et 
al.  [298]  noted  and  presented  results  showing  that  LVQ  can  overfit  on  some  datasets. 
Therefore  consideration  into  the  appropriate  number  of  PVs  is  important.  To  illustrate 
the  possibilities  of  LVQ  overfitting,  an  example  will  be  used.  While  the  data  examined 
by  Clark  [493]  is  not  available,  other  academic  datasets  are.  For  this  the  small  dataset 
Insects  will  be  used;  this  dataset  is  from  [499,  500]  and  consists  of  3  data  features,  3 
classes,  and  10  observations  per  class  with  no  missing  values.  To  examine  potential 
overfitting  effects,  one  randomly  selected  observations  from  each  class  was  sequestered 
in  a  test  set  and  an  LVQ  network  was  trained  with  the  remaining  27  observations.  The 
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number  of  PVs  per  class  was  then  increased  from  1-9,  with  a  constant  learning  rate  of 


c(t)=0.1  used  throughout,  600  randomly  generated  training  iterations  were  used.  Mean 
test  and  training  accuracy  was  then  recorded  for  100  replications.  Table  V-3  presents  the 
results  and  shows  that  LVQ  can  be  susceptible  to  overfilling  and  that  robustness  to 
overfilling  is  not  universal  for  all  LVQ  algorithms  in  all  applications. 


Table  V-3:  Example  of  PV  Architecture  Effects  on  LVQ  Performance  on  Insects. 


Number  of 
Input  Nodes 
(Features) 

Prototype 
Vectors 
(PVs)  PER 
CLASS 

Training 
Accuracy  (%) 

Testing 
Accuracy  (%) 

Mean 

Computation 
Time  (s) 

3 

1 

68.0 

69.67 

0.34 

3 

2 

80.7 

73.7 

0.50 

3 

3 

84.4 

77.3 

0.62 

3 

4 

85.9 

69.7 

0.69 

3 

5 

89.0 

70.0 

0.86 

3 

6 

89.5 

68.0 

1.16 

3 

7 

89.3 

66.0 

1.09 

3 

8 

89.9 

66.7 

1.59 

3 

9 

91.5 

68.0 

1.59 

Beyond  employing  as  many  PVs  as  possible,  as  suggested  by  [262],  which  can 
obviously  lead  to  overfitting  as  shown  in  Table  V-3,  the  LVQ  field  is  largely  bereft  of 
liteature  on  the  number  of  PVs  to  initialize.  However,  the  ANN  field  is  replete  with 
literature  regarding  appropriately  selecting  the  number  of  hidden  nodes  in  model 
development  and  includes  heuristic  approaches  [304,  501]  and  algorithmic  approaches 
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[502-504].  Of  interest  is  if  neural  network  heuristics  for  the  number  of  hidden  nodes  can 
be  extended  to  specifying  the  number  of  LVQ  PVs. 

5.4. 1.1  Extending  ANN  Architecture  Heuristics  to  LVQ 

Lv  et  al.  [253]  considered  1  PV  per  class;  for  RF-DNA,  Reising  [51]  used  10  PVs 
per  class;  however,  for  hyperspectral  target  detection,  Mendenhall  [244]  used  5  PVs  per 
class.  While  1  PV  per  class  is  a  minimum  requirement  for  LVQ  algorithms  [299],  and 
pennits  initializing  each  PV  to  the  centroid  (arithmetic  mean)  of  its  respective  group  as 
an  easy  and  logical  solution  to  the  initialization  problem,  using  too  few  PVs  can  yield 
poor  results  as  empirically  demonstrated  in  the  academic  example  in  Table  V-3. 

Although  Mendenhall  [244]  mentioned  using  heuristics  to  detennine  the  number 
of  PVs  for  GRLVQI,  they  were  not  formalized  for  the  family  of  LVQ  algorithms 
considered.  However,  Gage  [304]  investigated  and  developed  ANN  architecture 
approaches  where  the  size  of  the  hidden  layer  was  dependent  on  the  number  of  inputs, 
number  of  exemplars,  hidden  layer  weights,  and/or  the  number  of  neurons  at  each  layer. 
Although  LVQ  algorithms  are  ANNs,  a  few  difficulties  exist  in  extending  general  ANN 
methods  to  LVQ:  firstly,  the  general  LVQ  architecture  is  not  identical  to  ANN 
architecture,  as  described  in  Section  3.2;  secondly,  LVQ  requires  PVs  to  be  designated  to 
a  class;  and  finally,  LVQ  does  not  have  output  nodes  as  seen  in  an  ANN.  Despite  these 
differences,  some  empirical  formulas  for  ANN  architecture  specification  could  be 
applicable  to  LVQ  architecture  specification. 
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Many  heuristics  considered  by  Gage  [304]  involve  using  the  number  of  input 
features,  Nf,  the  number  of  exemplars,  N0bs,  and  the  number  of  output  layer  nodes,  Nout. 
Extending  this  to  LVQ  would  see  K  being  the  number  of  input  features  and  M 
representing  the  number  of  PVs;  since  LVQ  does  not  have  an  output  layer,  one  could 
interpret  Nout  as  being  either:  A)  nothing  since  LVQ  does  not  have  an  output  layer,  in 
which  case  Nout  would  be  treated  as  a  constant  1  (thus  Nout  is  equivalent  to  Npy  since  Npy 
is  effectively  the  output  layer  in  LVQ  models),  or  B)  we  could  logically  view  Nout  as  the 
number  of  classes,  consistent  with  [274]. 

Basic  neural  network  heuristics  include  the  general  following  advice,  that 

Npv .Looneyl  ~  d-Nc  (5.15) 

where  a  is  a  constant  and  Nc  classes  [250,  p.  101].  While  this  is  certainly  suitable  for 
LVQ  architectures  due  to  their  underlying  assumptions,  it  is  not  helpful  in  detennining 
Npy,  and  only  provides  the  obvious  lower  bound  of  NPV  =  c  for  a  =  1.  However,  an 
extension  of  this  approach  is  seen  in 

Npv.wc  =  Q-Np  (5.16) 

where  a  is  used  as  a  fraction  [505],  In  this  form,  a  has  variously  been  recommended  as 
either  0.75  [506,  507]  or  0.50  [508], 

Looney  [250,  p.  91]  presented  another  general  heuristic  of 

Npv, Looneyl  =  1°§2  (5-17) 

where  Nc  are  the  number  of  classes  in  the  dataset,  since  this  quantity  will  yield 
Npv, Looneyi  <  IVcPVs  it  is  not  appropriate  for  LVQ  models.  Similar  is  the  empirically 
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determined  approach  of  Gonnan  and  Sejnowski  [494],  noted  as  an  effective  heuristic  for 
ANNs  [509], 

^PV, Gorman  ~  lo§2  ^  (5.18) 

where  T  is  the  number  of  input  training  patterns,  however  this  tenninology  can  be 
interpreted  variously  (depending  on  what  one  means  by  “pattern”)  as  either 

Npv.Gormanl  ~  1°§2  (5-19) 

or 


or  possibly 


N pv ,Gorman2  lo§2  ^ obs 


(5.20) 


^PV, Gormans  log  2(Nc-Nobs).  (5.21) 

Additional  heuristics  include  one  from  Hayashi  [250,  p.  316;  510], 

Npv.Hay  —  Qyj N out  '  (5-22) 

where  q  is  a  multiplier  constant,  set  to  1  herein.  Walczak  and  Cerpa  [505]  presented  a 
heuristic  based  on  [496,  511]  that 

Npv.Kur  —  2  NF  +  1.  (5.23) 

Gao  et  al.  [501]  presented  the  following  heuristic, 

Npv.Gao  =  y/N0ut  ■ NF  +  q ,  (5-24) 

with  q  being  a  constant  between  1  and  10  and  attributed  it  to  [503],  Daqi  and  Shouyi 

[5 12]  present  the  following  heuristic 


NPV,Daqi  -  V ( Nout  +  2)  ■  Np  +  1,  (5.25) 

Gage  [304]  presents  a  heuristic  tenned  “Cover’s  theorem" 
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Npv.Gage  < 


O'S^obs  “  1 


(5.26) 


Np  +  1 

which  considered  the  number  of  exemplars,  P,  and  data  features  [304,  501]. 

5.4. 1.2  Developing  L  VO  Architecture  Heuristics 

Considering  the  heuristics  in  Section  5.4.1. 1,  the  GRLVQI  settings  of  [48,  247] 
and  the  absolute  minimum  of  Npy  =  1  for  the  ZigBee  RF-DNA  data  under  analysis  ( Nc  = 
4,  NFeats  =  729,  Nobs  =  1500),  one  arrives  at  Table  V-4.  Results  for  both  Nout  =  1  and  Nout  = 
Nc  are  computed. 


Table  V-4:  #PVs  for  RF-DNA  Using  Various  Heuristics  for  ZigBee  Data. 


Origination 

Heuristic 

Npy 

N out  1  (Nout 

IGNORED) 

N out  -Nc 

Interpreted 

Npy/ Nc 

ANNs 

Npy  Kur 

* 

* 

1459| 

Npv,  Looney  1 

* 

* 

4 

N py,  Looney 2 

* 

* 

2 

Npy,  Gage 

* 

* 

1 

Npv,  Gorman  1 

* 

* 

10 

Npv,  Gorman2 

* 

* 

11 

Npv,  Gorman3 

* 

* 

13 

Npv,  Gaol 

28 

55 

14,  28 

Npy,  Hay 

27 

54 

14,  27 

Npv,  Daqi 

48 

68 

17,  48 

Npy,  WC 

* 

* 

365-547f 

LVQ 

Npy,  Min 

* 

1 

Npvt  Mendenhall 

* 

5 

Npy,  Reising 

* 

10 

indicate  heuristic  is  not  a  function  of  Nout  and  hence  this  quantity  is  not  computed 
f  indicates  obviously  unreasonable  values  for  NPV 
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Based  on  the  results  presented  in  Reising  [51]  and  Dubendorfer  [91],  both  for  NPV 
=  10  per  class  for  RF-DNA  Fingerprints,  we  can  safely  exclude  the  number  of  PVs 
suggested  by  AW,  Wc  and  AW,  kw  as  considerably  too  many.  However,  the  remaining 
heuristics  suggest  numbers  of  PVs  that  appear  reasonable. 

A  vector  of  quantities  of  PVs  (per  class)  to  consider  was  fonnulated  as: 

NPV  =  [ 1,  5,  7,  8,  9,10,11,12,13,15,20,27,37,48].  (5.27) 

Npv  =  [7,  8,  9,  11,  12,  13]  per  class  were  also  considered  in  order  to  search  for  suitable 
operating  points  across  the  heuristic  space  and  around  the  nominal  setting  of  10  Npy/Nc . 
Values  of  14  and  17  NPV/Nc  were  not  considered  since  these  are  close  to  15  NPV/  Nc  to 
avoid  superfluous  computational  runs.  Values  above  48  NPV/Nc  were  not  initially 
considered  due  to  the  extra  computation  time  required,  and  thus  these  would  only  be 
considered  if  the  results  indicate  a  potential  utility  in  exploring  these  settings. 

Figure  V-16  considers  GRLVQI  results  on  the  ZigBee  dataset  at  14  dB  for  the 
AW  values  in  (5.27).  The  preliminary  results  in  Figure  V-16  shows  that  overfitting  would 
be  an  issue  if  too  many  PVs  were  instantiated,  AW  >  20,  and  that  poor  accuracy  would 
result  if  too  few  PVs  were  instantiated,  AW  <  9.  From  Figure  V-16,  NPV=  13  offers 
overall  higher  training,  testing,  and  validation  accuracy  than  Npy=  10;  additionally,  the 
overall  difference  between  higher  training,  testing,  and  validation  accuracy  are  small 
when  compared  to  AW>  15. 
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Figure  V-15:  GRLVQI  Classification  Results  on  ZigBee  RF-DNA  Fingerprints  at  14 

dB  Using  Various  PVs/class. 

Figure  V-16  presents  classification  perfonnance  results  from  considering  Squared 
Euclidean  GRLVQI  for  Npy  =  [10,  13]  with  the  ZigBee  RF-DNA  Fingerprints.  As  seen 
in  Figure  V-16  classification  perfonnance  appears  comparable  for  SNR  >  lOdB,  with  Npy 
=  13 offering  a  slight  improvement  in  gain  of  +0.4  ldB  (training)  and  +0.5  ldB  (testing)  at 
90%  accuracy.  However  classification  perfonnance  appears  markedly  improved  for  low 
SNR,  and  between  5dB  and  lOdB  GRLVQI  with  Npy  =13  offers  a  gain  of  +1.85dB 
(training)  and  +2.27dB  (testing)  at  70%  accuracy. 
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Figure  V-16:  GRLVQI  Classification  Performance  with  fO  PVs  versus  13  PVs. 

When  considering  verification  accuracy  with  Squared  Euclidean  GRLVQI  using 
Npy=  13,  one  can  see  in  Figure  V-17  to  Figure  V-19  that  more  structure  is  seen  when 
compared  to  the  verification  results  seen  in  Section  III  for  Npy=  10. 
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True  Verification  Rate  (TVR)  True  Verification  Rate  (TVR) 


False  Verification  Rate  (FVR) 


0  0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9  1 
Rogue  Accept  Rate  (RAR) 


a)  Authorized  b)  Rogue 

Figure  V-17:  Verification  Performance  in  GRLVQI  with  13  PVs  at  8dB. 
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Figure  V-18:  Verification  Performance  in  GRLVQI  with  13  PVs  at  14dB. 
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Figure  V-19:  Verification  Performance  in  GRLVQI  with  13  PVs  at  18dB. 


Table  V-5  presents  an  overall  comparison  of  classification  and  verification 
performance  for  Squared  Euclidean  GRLVQI  with  Npv  =  [10,  13].  Overall,  classification 
performance  is  largely  improved  with  13  PVs  while  verification  perfonnance  is  greatly 
improved  for  low  SNR  and  slightly  worse  for  higher  SNR.  Overall,  one  can  conclude 
that  13  PVs  offers  measurable  perfonnance  improvements  over  the  10  PVs.  However, 
possible  changes  to  learning,  relevance  and  conscience  rates  have  not  been  considered. 
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Table  V-5:  Relationship  between  PVs  and  Classification/Verification  Performance 


NpV 

Classification  Performance 

Verification  Perfonnance 

SNR  (dB)  at 
90%C 

AUCCtng 

AUCCtst 

%Authorized  or 
%Rogue  Rejected  at 
8dB 

%Authorized  or 
%Rogue  Rejected  at 
14dB 

%Authorized  or 
%Rogue  Rejected  at 
18dB 

TNG 

TST 

Authorized 

Rogue 

Authorized 

Rogue 

Authorized 

Rogue 

10 

12.92 

12.39 

24.99 

25.24 

0% 

0% 

25% 

47.22% 

25% 

63.88% 

13 

12.51 

11.88 

25.27 

25.51 

25% 

22.22% 

25% 

50% 

25% 

52.78% 
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5.4.2  Experimental  Design  for  GRLVQI-D  Algorithmic  Settings 

Employing  experimental  designs  to  find  optimal  algorithm  settings  has  been  seen 
in  hyperspectral  anomaly  detection  research,  c.f.  [513-520],  but  not  in  prior  RF-INT 
efforts.  However,  herein,  detennining  appropriate  algorithmic  settings  is  of  prime 
interest  since  neither  Cosine  GRLVQI  nor  Canberra  GRLVQI  algorithms  have  been 
previously  developed  or  applied  to  RF-DNA  problems.  Therefore  it  is  unknown  what 
settings  are  appropriate  for  these  algorithms. 

Following  the  discussions  in  Sections  5. 3. 3-5. 3. 5,  a  few  observations  can  be 
made,  1)  that  Cosine  and  Squared  Euclidean  variants  of  LVQ,  RLVQ,  GLVQ  and 
GRLVQ  perform  similarly  well  in  classification  of  Fisher  Iris;  2)  that  Cosine  LVQ 
variants  perfonn  best  with  both  a  learning  rate  10  times  or  greater  and  a  relevance  rate 
1/10  to  1/100  of  that  seen  in  Squared  Euclidean  LVQ  variants;  and  3)  that  Canberra 
variants  similarly  performed  best  with  both  a  learning  rate  10  times  or  greater  than 
Squared  Euclidean,  but  appeared  invariant  to  relevance  learning  rate.  Additionally,  in 
Section  5.4.1,  we  learned  that  changing  the  number  of  PVs  can  significantly  impact 
GRLVQI  performance. 

5.4.2. 1  Full  Factorial  Model 

To  determine  optimal  settings  for  Squared  Euclidean  GRLVQI,  Cosine  GRLVQI, 
and  Canberra  GRLVQI  algorithms,  a  full  factorial  experiment  was  considered.  Table 
V-6  presents  the  35  design  wherein  the  middle  (0)  design  settings  are  those  employed  by 
Reising  [51],  the  high  and  low  settings  for  learning  and  relevance  rates  are  magnitudes  of 
10  above  and  below,  respectively,  the  middle  settings  per  the  observations  in  Sections 
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5. 3. 3-5. 3. 5.  Two  conscience  rates  are  present  in  GRLVQI  and  the  scale  of  these  differs 
from  the  learning  and  relevance  rates;  Table  III-3  presented  training  steps  and 
corresponding  explored  conscience  rates  where  y  is  seen  to  be  initialized  as  2.0  and  reach 
an  absolute  minimum  (after  many  training  steps)  of  0.75,  and  /?  is  initialized  0.35  and 
reach  an  absolute  minimum  of  0.10.  To  account  for  this  range  and  explore  other  possible 
good  settings,  the  full  factorial  experiment  explores  a  low  setting  of  0.5  and  a  high  setting 
of  4.5  for  y  and  a  low  setting  of  0.15  and  a  high  setting  of  0.55  for  /?.  Additionally,  the 
number  of  PVs  is  considered  as  a  fifth  factor  where  13  PVs  per  class  is  considered  as  the 
high  value  and  7  PVs  per  class  is  considered  as  the  low  value,  per  the  discussion  in  5.4.1. 


Table  V-6:  Experimental  Design  Region  for  GRLVQI. 


Factors 

Factor  A 

Factor  B 

Factor  C 

Factor  D 

Factor  E 

Factor 

Level 

Learning 
Rate  (e) 

Relevance 
Rate  (f) 

Conscience 
Rate  1  (y) 

Conscience 
Rate  2  (/?) 

NPV 

Low  (-) 

0.0025 

0.0005 

0.5 

0.15 

7 

Middle  (0) 

0.025 

0.005 

2.0 

0.35 

10 

High  (+) 

0.25 

0.05 

4.5 

0.55 

13 

Employing  the  settings  from  Table  V-6  yields  a  total  of  243  different  setting 
combinations  per  GRLVQI-D  variant.  To  consider  all  of  these  possible  operating  points, 
Z-Wave  RF-DNA  data,  as  described  in  Section  III  and  employed  in  [49],  was  used  due  to 
the  much  smaller  size  of  this  data  set  and  its  signal  similarity  to  ZigBee.  Appendix  K 
presents  mean  training  and  testing  AUCC  along  with  mean  verification  AUC  values 
experimental  results  from  Z-Wave  for  the  Cosine,  Canberra  and  Squared  Euclidean 
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GRLVQI  algorithms  grouped  by  distance  measures  for  the  experimental  design  in  Table 
V-6.  To  expedite  the  computational  process,  the  baseline  Squared  Euclidean  GRLVQI 
algorithm  employed  MATLAB  compiled  c-code  (*.mex)  files,  which  were  compiled  via 
the  approach  in  Appendix  L. 


5. 4.2.2  Response  Surface  Methodolot 


After  the  experimental  runs  in  Section  5.4.2. 1  were  complete,  a  second  order 


model  with  squared  terms  and  two-way  interactions  was  considered: 


s  s  s 

/(x)  =  B0  +  Y  Bixi  +  Y  Bi-iXiXi  +  Y  Bi’iX?'  ^5  28^ 

i= 1  1  i,J,i= 1 

where  s  represents  the  number  of  factors,  B  tenns  are  coefficients  solved  for  via  a  general 
linear  model,  and  x  represents  a  given  factor  [513].  Two  initial  second  order  models 
were  created  per  algorithm  with  all  parameters  and  interactions  (termed  “Full  Model”) 
after  applying  (5.28)  with  either  classification  (mean  AUCC)  or  verification  (mean  AUC) 
accuracy  as  the  dependent  variable.  All  models  were  statistically  significant  using  a  = 
0.05,  but  not  all  features  and  interactions  were  significant,  reduced  models  were  therefore 
created  by  creating  a  second  order  model  that  only  contained  main  effects  (factors  in 
Table  V-6,  whether  or  not  significant)  and  any  significant  second  order  effect.  Table  V-7 
presents  an  overview  of  the  second  order  models  by  reporting  R"  and  adjusted  R  values 
for  both  the  full  and  reduced  models. 

As  seen  in  Table  V-7,  the  classification  models  from  Squared  Euclidean  data 
explains  a  significant  amount  of  variance  in  the  data  while  the  verification  based  models 
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do  not  explain  much  variation.  When  considering  the  Cosine  GRLVQI  models,  both  the 
classification  and  verification  models  explain  most  of  the  variation  in  the  data;  however 
neither  the  classification  nor  the  verification  based  models  explain  much  variation  when 
considering  the  Canberra  GRLVQI  results. 


Table  V-7:  Overview  of  Second  Order  Models. 


Algorithm 

Squared 

Euclidean 

GRLVQI 

Cosine 

GRLVQI-D 

Canberra 

GRLVQI-D 

Dependent 

Variable 

Class. 

Ver. 

Class. 

Ver. 

Class. 

Ver. 

Full 

R2 

0.900 

0.246 

0.942 

0.829 

0.259 

0.408 

Model 

R2  Adjusted 

0.891 

0.178 

0.937 

0.814 

0.193 

0.355 

Reduced 

R2 

0.898 

0.241 

0.938 

0.824 

0.215 

0.399 

Model 

R2  Adjusted 

0.892 

0.195 

0.936 

0.817 

0.188 

0.378 

Table  V-8  presents  variables  that  were  deemed  statistically  significant  in  the  full 
model.  Again,  in  all  reduced  models  main  effects  were  included  for  completeness.  In 
Table  V-8,  an  “X”  indicates  that  a  variable  is  statistically  significant,  at  a  =  0.05,  while  a 
“?”  indicates  that  a  variable  has  a  /7-value  between  0.05  and  0.10,  which  should  be 
considered  as  statistically  significant  at  a  =  0.05,  per  [369].  As  seen  in  Table  V-7,  the  R~ 
and  adjusted  R“  are  largely  unchanged  when  considering  the  reduced  models,  indicating 
that  the  removed  features  were  not  explaining  much  variation  in  the  data. 
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Table  V-8:  Features  Significant  Per  Model. 


Model 

Feature 

Squared 

Euclidean 

GRLVQI 

Cosine 

GRLVQI-D 

Canberra 

GRLVQI-D 

Class. 

Ver. 

Class. 

Ver. 

Class. 

Ver. 

e 

X 

X 

X 

X 

X 

s 

X 

X 

y 

X 

P 

X 

X 

NpV 

X 

X 

X 

X 

X 

X 

e2 

X 

X 

X 

X 

X 

f2 

X 

X 

Y2 

X 

X 

/?2 

X 

N2v 

X 

X 

X 

X 

X 

e  x  y 

X 

? 

e  x  /? 

6  X  Npy 

X 

X 

^  x  7 

f*P 

^  x 

yxp 

y  x  Wpj/ 

? 

/?  X  iVpy 

X 

X 
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5. 4.2. 3  Setting  Optimization 

As  mentioned  in  Chapter  3,  determining  appropriate  settings  for  LVQ  algorithms 
is  a  largely  untouched  domain;  however,  after  finding  reduced  second  order  models,  one 
can  solve  for  optimal  algorithmic  settings  where  the  target  are  the  dependent  variables 
(either  mean  classification  or  mean  verification  accuracy).  Determining  appropriate 
settings  is  of  critical  important  for  distance  measure  variants  of  GRLVQI  since  these 
have  unknown  operating  characteristics. 

Constrained  nonlinear  optimization,  or  interior  point  optimization,  consistent  with 
[521-523]  was  used  to  maximize  the  final,  reduced,  second  order  models.  A  constrained 
minimization  (where  the  target  accuracies  were  negated  since  the  goal  of  maximization  is 
possible  by  minimizing  a  negation)  was  considered  where  a  finite-difference 
approximation  was  computed  by  starting  with  an  initial  estimate  (the  baseline  GRLVQI 
settings).  The  minimization  was  constrained  between  the  minimum  and  maximum  values 
seen  in  Table  V-6  to  avoid  computing  values  outside  those  explored  (e.g.  when 
considered  unbounded  optimization  yielded  settings  far  outside  the  design  space,  with 
magnitudes  ranging  from  10 13  to  1042).  The  optimal  solution  was  then  computed  via 
sequential  quadratic  programming  (SQP)  [524,  525]  wherein  a  line  search  was  employed, 
consistent  with  [524-526]. 

Resultant  optimal  algorithmic  settings  for  each  factor  are  presented  in  Table  V-9. 
Here,  settings  are  grouped  in  pairs  of  rows  by  algorithm  and  then  by  whether  mean 
classification  AUCC  or  mean  verification  AUC  were  used  at  the  target.  Evident  in  Table 
V-9  is  that  only  Npv  =  7  was  consistently  found  as  optimal  between  algorithms. 
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Otherwise,  most  factors  had  different  optimal  algorithmic  settings.  Additionally,  all 
optimal  settings  were  different  from  the  baseline  settings  as  employed  by  [51]. 


Table  V-9:  Optimized  Algorithms  Settings  for  Z-Wave  Data. 


Factors 

Factor  A 

Factor  B 

Factor  C 

Factor  D 

Factor  E 

Algorithm 

Learning 
Rate  (c) 

Relevance 
Rate  (O 

Conscience 
Rate  1  (y) 

Conscience 
Rate  2  (/?) 

Npv 

Squared 

Euclidean 

GRLVQI 

Class. 

0.1497 

0.0005 

4.5 

0.3128 

7 

Ver. 

0.1481 

0.05 

0.5 

0.15 

7 

Cosine 

GRLVQI-D 

Class. 

0.1376 

4.5 

0.55 

7 

Ver. 

0.135 

0.0005 

0.5016 

0.15 

7 

Canberra 

GRLVQI-D 

Class. 

0.25 

0.032 

0.5 

0.15 

7 

Ver. 

0.25 

0.032 

0.5 

0.15 

7 

Baseline 

— 

0.025 

0.005 

2.0 

0.35 

10 

5.4.3  GRLVQI-D  Performance  Results 

Classification  and  verification  performance  can  be  considered  using  the  optimized 
algorithmic  settings.  Z-Wave  classification  perfonnance  will  be  considered  relative  to 
the  baseline  classifier  settings  of  Reising  [51].  Three  sets  of  classification  results  are 
considered  in  Figure  V-20  through  Figure  V-24.  Figure  V-20  presents  training  (TNG) 
and  testing  (TST)  classification  results  from  the  baseline  Squared  Euclidean  GRLVQI 
algorithm,  the  Squared  Euclidean  GRLVQI  algorithm  using  the  Classification-based 
optimized  settings  in  Table  V-9,  and  the  Squared  Euclidean  GRLVQI  algorithm  using  the 
Verification-based  optimized  settings  in  Table  V-9.  Noticeably,  classification 


performance  appears  markedly  improved  when  using  either  optimized  setting,  which  also 
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have  fewer  PVs,  Npy  =  7,  and  thus  are  computationally  simpler  algorithms.  The 
Classification-based  optimized  Squared  Euclidean  GRLVQI  shows  an  improvement  in 
gain  of  +1.98  dB  (TNG)  and  +1.94  dB  (TST)  at  90%  accuracy;  the  Verification-based 
optimized  Squared  Euclidean  GRLVQI  shows  an  improvement  in  gain  of  +1.31  dB 
(TNG)  and +1.48  dB  (TST). 


Figure  V-20:  GRLVQI  Classification  Performance  Using  Squared  Euclidean 
Distance  Using  Optimized  Algorithmic  Settings. 

Figure  V-21  presents  the  verification  accuracy  of  both  optimized  Squared 
Euclidean  GRLVQI  algorithms;  one  can  see  that  the  Classification-based  Squared 
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Euclidean  GRLVQI  Figure  V-21a  offer  100%  verification  accuracy,  which  improves 
upon  the  33.33%  classification  accuracy  of  the  baseline,  reported  in  Section  III  and  [49]. 
Additionally,  the  mean  verification  AUC  of  0.9707  is  slightly  higher  than  the  mean  AUC 
of  the  baseline,  0.9363.  When  considering  the  Verification-based  Squared  Euclidean 
GRLVQI  perfonnance  in  Figure  V-21b,  the  performance  is  noticeably  poor,  with  no 
devices  authorized  at  10%  FVR  and  90%  TVR.  Additionally,  the  curves  in  Figure  V-21b 
are  significantly  worse  than  baseline  with  a  mean  AUC  of  0.5916.  Overall,  it’s  evident 
that  the  optimized  settings  from  the  Classification-based  Squared  Euclidean  GRLVQI 
offer  improved  perfonnance  over  baseline,  while  using  the  optimized  settings  from  the 
Verification-based  Squared  Euclidean  GRLVQI  classifier  offers  comparably 
unreasonable  verification  performance. 


211 


a) 


False  Verification  Rate  (FVR) 

Classification-Based  Optimization 


Oh 

> 

H 

u 

"ea 

Pi 

a 

.o 

-4— ' 

ca 

o 

5Jh 


o.s 


> 

0) 

5 

H 


0.3 

0.2 

0.1 

0 


i 

--I 

! 

i 

f 

""1 

f 

/ 

. 

a 

rr-' 

J 

/ 

1 

r 

. f 

> 

A 

* 

J 

y 

■Authorized  (90%)  at  10%  FPF 
Rejected  (90%)  at  10%  FPF 


i . I . 


•rr 


0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9  1 
False  Verification  Rate  (FVR) 


b)  Verification-Based  Optimization 


Figure  V-21:  GRLVQI  ID  Verification  Performance  in  Squared  Euclidean 
GRLVQI  using  Optimization  Settings  at  20dB  for  Z-Wave  Dataset. 


Classification  results  from  the  Canberra  GRLVQI-D  classifier  are  presented  in 
Figure  V-22  for  the  Classification-based  and  Verification-based  optimized  settings  with 
the  Z-Wave  data.  The  performance  of  both  is  dramatically  below  the  baseline  Squared 
Euclidean  GRLVQI  algorithm.  Figure  V-23  presents  the  verification  accuracy  of  both 
optimized  Cosine  GRLVQI  algorithms;  one  can  see  that  neither  the  Classification-based 
Canberra  GRLVQI-D  in  Ligure  V-23a  nor  the  Verification-based  Canberra  GRLVQI-D 
in  Figure  V-21b  perform  well.  Additionally,  the  curves  in  Figure  V-21b  are  significantly 
worse  than  baseline  with  a  mean  AUC  of  0.5916.  Overall,  it’s  evident  that  Canberra 
GRVLQI-D,  at  least  with  the  considered  settings,  appears  unsuitable  for  RF-DNA 
applications. 
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Figure  V-22:  GRLVQI-D  Classification  Performance  Using  Canberra  Distance 
Using  Optimized  Algorithmic  Settings. 
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Figure  V-23:  GRLVQI-D  ID  Verification  Performance  in  Canberra  GRLVQI  using 
Optimization  Settings  at  20dB  for  Z-Wave  Dataset. 


Classification  results  from  the  Cosine  GRLVQI-D  classifier  are  presented  in 
Figure  V-24  for  the  Classification-based  and  Verification-based  optimized  settings  with 
the  Z-Wave  data.  In  contrast  to  the  Canberra  GRLVQI-D  algorithms  of  Figure  V-22,  the 
Cosine  GRLVQI-D  classification  results  offer  improved  perfonnance  over  the  baseline 
Squared  Euclidean  GRLVQI  algorithm.  The  Classification-based  optimized  Squared 
Euclidean  GRLVQI  shows  an  improvement  in  gain  of  +1.57  dB  (TNG)  and  +1.91  dB 
(TST)  at  90%  accuracy;  the  Verification-based  optimized  Squared  Euclidean  GRLVQI 
shows  an  improvement  in  gain  of  +  1.67  dB  (TNG)  and  +1.84  dB  (TST).  Performance  is 
thus  similar  to  the  optimized  Squared  Euclidean  GRLVQI  algorithm  in  Figure  V-20. 
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Figure  V-24:  GRLVQI  Classification  Performance  Using  Cosine  Distance  Using 

Optimized  Algorithmic  Settings. 


Figure  V-25  presents  the  verification  accuracy  of  both  optimized  Cosine 
GRLVQI-D  algorithms;  one  can  see  that  both  the  Classification-optimized  Cosine 
GRLVQI-D  Figure  V-25a  and  Verification-optimized  Cosine  GRLVQI-D  offer  66.6% 
verification  accuracy,  which  improves  upon  the  33.33%  verification  accuracy  of  the 
baseline,  as  reported  in  Section  III  and  [49],  but  is  slightly  worse  than  the  100% 
verification  accuracy  of  the  Classification-optimized  Squared  Euclidean  GRLVQI 


algorithm  in  Figure  V-21a.  However,  the  mean  verification  AUC  of  both  Cosine 
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GRLVQI-D  variants  is  0.9712  which  is  equivalent  to  the  mean  verification  AUC  of 
0.9707  for  the  Classification-optimized  Squared  Euclidean  GRVLQI  algorithm. 


False  Verification  Rate  (FVR)  False  Verification  Rate  (FVR) 

a)  Classification-Based  Optimization  b)  Verification-Based  Optimization 

Figure  V-25:  GRLVQI  ID  Verification  Performance  in  Cosine  GRLVQI  using 
Optimization  Settings  at  20dB  for  Z-Wave  Dataset. 


Table  V-10  presents  an  overall  comparison  of  classification  and  verification 
performance  for  the  Squared  Euclidean  GRLVQI  algorithm,  the  Cosine  GRLVQI-D 
algorithm,  and  the  Canberra  GRVLQI-D  algorithm.  Overall,  classification  perfonnance 
is  largely  improved  over  baseline  when  using  either  the  Optimized  (either  Classification 
or  Verification  based)  Cosine  GRLVQI-D  algorithm  or  the  Classification-optimized 
Squared  Euclidean  GRLVQI  algorithm.  Canberra  GRLVQI-D  offers  no  performance 
benefits  and  thus  it  is  not  further  considered. 
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Table  V-10:  Z-Wave  Optimized  Algorithms  Results  for  Z-Wave  Data. 


Result 

Classification 

Verification  at 
20  dB 

Algorithm 

RAP  TNG 

RAP  JST 

SNR  Gain  (dB) 
Relative  to  TST 
Baseline  at 

90  %C 

%  Auth. 

Mean 

AUC 

TNG 

TST 

Squared 

Euclidean 

GRLVQI 

None 

(Baseline) 

1.01 

1.00 

+0.4 

0.00 

33.33% 

0.936 

Class. 

1.06 

1.06 

+0.44 

+  1.94 

100% 

0.971 

Ver. 

1.03 

1.01 

+0.23 

+  1.48 

0% 

0.592 

Cosine 

GRLVQI-D 

Class. 

1.03 

1.01 

+0.06 

+  1.91 

66.67% 

0.971 

Ver. 

1.02 

1.03 

+0.23 

+  1.84 

66.67% 

0.971 

Canberra 

GRLVQI-D 

Class. 

0.58 

0.54 

N/A 

0% 

0.740 

Ver. 

0.58 

0.53 

N/A 

0% 

0.560 

Appendix  M  presents  an  extension  of  the  Z-Wave  optimized  GRLVQI  and 
GRLVQI-D  algorithms  with  ZigBee  data.  While  the  optimized  algorithms  improve 
perfonnance  for  Z-Wave  data,  the  results  in  Appendix  M  illustrate  the  difficulty  in 
applying  optimized  settings  from  one  dataset  to  a  different  dataset.  Thus,  if  ZigBee 
devices  are  of  specific  interest,  one  would  desire  to  optimize  GRLVQI  and  GRLVQI-D 
algorithmic  settings  for  these  devices. 

5.4.4  Results  Interpretation 

Overall,  the  process  and  methodology  presented  in  this  chapter  enable  one  to 
create  distance  measure  variants  of  LVQ  algorithms  including  LVQ,  RLVQ,  GLVQ, 


GRLVQ,  and  GRLVQI.  The  derivative  skeleton  presented  in  Section  5. 2.2. 4  further 
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enables  one  to  make  any  reasonable  change  to  the  cost  function  of  LVQ  family  of 
algorithms.  The  Npv  heuristics  and  optimization  scheme  provide  a  further  approach  for 
selecting  reasonable  settings  for  these  algorithms. 

Optimization  of  GRLVQI  was  considered  using  both  Classification  accuracy  and 
Verification  accuracy  as  an  objective.  Z-Wave  data  was  employed  due  to  the  smaller  size 
of  the  dataset  and  the  requirement  for  a  multitude  of  algorithmic  run,  as  seen  in  Appendix 
K.  When  optimized  settings  were  considered  and  evaluated  on  Z-Wave  data,  both  the 
Classification-optimized  Squared  Euclidean  GRLVQI  algorithm  and  the  Classification 
and  Verification-optimized  Cosine  GRLVQI-D  algorithms  offered  better  performance 
than  the  baseline  settings  of  [51].  The  results  for  both  the  optimized  Squared  Euclidean 
GRLVQI  and  optimized  Cosine  GRLVQI-D  algorithms  are  reasonable  and  hence  the 
optimization  method  and  process  show  efficacy  for  finding  robust  points  when  other 
devices  are  under  analysis,  and  for  recommending  new  operating  points  for  either  new 
algorithms,  such  as  Cosine  GRLVQI-D,  or  new  signal  modalities. 
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VI.  Improvements  to  the  RF-DNA  Fingerprinting  Process 


Adorn  thyself  with  simplicity 

-Marcus  Aurelius,  121-180 

In  operation,  as  described  in  Chapter  II,  the  AFIT  RF-DNA  process  consists  of 
two  main  elements,  including  signal  collection  (accomplished  using  various  signal 
collection  equipment)  and  post-collection  processing  (accomplished  using  software). 
After  collection,  the  data  is  digitally  filtered  and  processed  to  create  samples  at  various 
desired  analysis  SNR  levels.  Subsequently,  RF-DNA  fingerprints  are  computed  and 
various  device  classification  schemes  applied  for  model  development.  In  computed  RF- 
DNA  fingerprints,  as  described  in  Section  2.4,  the  signal  Region  Of  Interest  (ROI)  in  is 
divided  into  multiple  subregions  (Nr  total),  each  with  Ns  time  samples  per  subregion.  In 
each  subregion,  mathematical  moments  of  mean  (//),  variance  (a  ),  skewness  (y),  and 
kurtosis  (k),  using  (2.9),  (2.6)-(2.8)  respectively,  are  computed  to  provide  insight  into  the 
distribution  shape  about  its  mean.  Of  interest  in  this  chapter  are  potential  improvements 
that  can  be  made  to  the  RF-DNA  Fingerprinting  process  by  leveraging  research  and 
methods  in  statistical  data  analysis,  and  simulation  studies. 

6.1  Introduction 

First,  Section  6.2  will  examine  data  analysis  methods  and  possibly  underlying 
reasons  for  the  dominance  of  phase  features  in  RF-DNA  Dimensional  Reduction 
Analysis  (DRA).  Then,  Section  6.3  will  consider  extensions  of  Simulation,  an 
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operations  research  tool  for  examining  steady  state  conditions  from  a  time  sample 
application  [136],  to  RF-DNA. 

6.2  Normalization,  Standardization  and  Phase  Feature  Dominance 

Prior  works,  such  as  [113],  have  concluded  that  phase  features  were  significantly 
more  useful  for  classification  and  verification  than  either  amplitude  or  frequency  features. 
However,  no  reasons  for  this  observation  have  been  detennined.  Three  possible  reasons 
for  this  result  are  hereby  posited:  1)  the  mean  centering  and  maximum  scaled 
nonnalization  in  [19]  produces  this  as  an  artifact,  2)  the  signal  modulation  method,  e.g. 
ZigBee  is  Phase  modulated  as  described  in  Section  2.2.1,  is  reflected  in  this  result,  and  3) 
intrinsic  qualities  of  amplitude,  phase,  and  frequency  responses  are  being  represented.  Of 
interest  here  are  considering  1)  and  3)  since  2)  requires  collecting  signals  from  a  wide 
variety  of  devices. 

6.2.1  Phenomenology  of  Amplitude,  Frequency,  and  Phase 

Conclusive  reasons  for  the  dominance  of  phase  features  in  RF-DNA  research  do 
not  exist;  however,  various  potential  reasons  do  exist  and  are  related  to  the 
phenomenology  of  amplitude,  frequency,  and  phase.  Amplitude,  frequency,  and  phase 
are  related  quantities  that  can  describe  a  signal.  All  three  quantities  are  inter-related  via 
the  expressions  described  in  (2.2)-(2.4)  and  [64,  191,  192,  527].  In  computation  for  a 
real-valued  signal,  instantaneous  amplitude  is  computed  as  the  magnitude  at  a  given  point 
in  time,  instantaneous  phase  is  then  computed  as  the  angle  of  the  signal’s  Hilbert 
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transform,  and  finally,  instantaneous  frequency  is  computed  as  the  gradient  of 
instantaneous  phase  [64,  191,  192,  527]. 

While  environmental  characteristics  may  be  captured  in  all  three  measurements 
[528],  they  are  more  pronounced  in  amplitude,  e.g.  amplitude  modulated  (AM)  radio 
signals  are  more  susceptible  to  storm  disturbances  than  frequency  modulated  (FM)  radio 
signals  [529].  The  ZigBee  and  Z-Wave  devices  of  interest  herein  are  Phase  Modulated 
(PM)  signals;  PM  signals  are  designed  so  that  amplitude  variations  are  small  with  ideally 
constant  amplitude  [527].  Additionally,  in  RF-DNA  research  relatively  narrow  frequency 
regions  are  generally  isolated  through  filtering  such  that  the  signal  itself  may  not  vary 
much  in  frequency.  Additional  reasons  for  phase  features  being  most  significant  could 
include  phase  noise  due  to  production  variations  [530]  and  that  phase  variations  have  a 
more  irregular  pattern,  short  settling  duration  and  a  smaller  dynamic  range  [112]. 
Therefore,  it  seems  reasonable  that  phase  features  dominate,  and  especially  for  the  PM 
signals. 

6.2.2  Normalization  and  Standardization 

When  one  examines  a  boxplot  of  the  ZigBee  features,  Figure  VI- 1,  it  is  seen  that 
phase,  amplitude  and  frequency  features  have  different  distributions.  Boxplots  are  akin  to 
plotting  a  histogram  in  condensed  fonn  [531,  532],  thus  permitting  the  distribution  of 
multiple  features  to  be  evaluated  side-by-side.  The  boxplot  fonnat  presented  in  Figure 
VI- 1  employs  a  “compact”  format  with  a  black  dot  indicating  the  median,  thick  blue  lines 
to  show  the  range  from  the  25th  to  75th  percentiles,  thin  blue  lines  to  encompass  all  other 
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non-outlier  data,  and  small  blue  circles  to  indicate  outliers  [533].  Figure  VI- 1  shows  that 
the  distribution  and  medians  of  phase  features  are  more  constrained  than  amplitude  or 
frequency  features.  Additionally,  phase  features  appear  to  have  fewer  outliers. 


T - 1 - 1 - 1 - r 


0  100  200  300  400  500  600  700750 

Feature  Number 


Figure  VI-1:  Boxplot  of  ZigBee  RF-DNA  Features  at  SNR  =  lOdB  for  Authorized 
Devices  Using  the  Nominal  Mean  Centering  and  Maximum  Scaled  Normalization 

process  of  [18, 19], 

Due  to  characteristics  of  PM  signals,  any  data  normalization  process  could  further 
impact  feature  relevance.  The  nominal  RF-DNA  Fingerprinting  process  incorporates  a 
mean  centering  and  maximum  scaled  normalization  approach  seen  in  (2.5)  of  Section  2.4. 
While  mean  centering  and  maximum  scaled  normalization  does  not  appear  in  reviews  of 
nonnalization  methods,  e.g.  [534],  this  approach  is  consistent  with  various  applications, 
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c.f.  [535-541],  However,  the  reason  for  using  this  approach  is  rarely  provided;  one 
exception  is  Cobb  et  al.  [18]  who  indicated  that  this  nonnalization  approach  was  used  to 
account  for  any  “uncontrolled  power  variation.” 

Classifiers  and  neural  network  approaches  frequently  work  best  with  input  data 
nonnalized  by  some  means  [542];  however,  it  is  very  common  to  employ  standard  score 
normalization  (standardization)  —c.f.  [330,  534,  543].  The  boxplots  in  Figure  VI- 1 
display  that  the  data  has  different  ranges  for  amplitude,  frequency  and  phase,  and  hence 
examining  any  issue  with  the  mean  centering  and  maximum  scaled  nonnalization 
approach  is  important. 

To  examine  the  effect  of  normalization  on  RF-DNA,  a  revised  RF-DNA 
normalization  was  therefore  applied  in  the  form  of 


9  c  M  = 


g\n]  -  qg 


(6.1) 


std(gc[n]y 

where  g  in  (6.1)  represents  the  signal  of  interest,  per  the  respective  RF-DNA  fingerprint 
elements  in  (2.2)-(2.4)  for  n  =  1,2,  . . .,  Ns,  where  Ns  represents  the  number  of  samples  in 
the  region  of  interest  (ROI),  and  pg  represents  the  mean  of  the  g-th  fingerprint  element. 

After  standardizing  the  data,  the  RF-DNA  fingerprinting  process  was  followed, 
otherwise  unaltered,  and  the  resultant  standardized  RF-DNA  features  are  presented  in 
Figure  VI-2.  The  data  means  now  appear  more  centered  and  the  ranges  of  the 
distributions  of  amplitude  and  frequency  appear  more  constrained. 
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Figure  VI-2:  Boxplots  of  ZigBee  RF-DNA  Features  at  lOdB  for  Authorized  Devices 

Using  Standardized  Data. 

When  using  standardized  RF-DNA  features  with  MDA/ML  processing,  negligible 
gains  (G),  the  reduction  in  required  SNR  expressed  in  dB  to  achieve  a  given  %>C,  of 
G  =  0.09  dB  (TNG)  and  G  =  0.06  dB  (TST)  are  realized  at  %C  =  90%  when  compared 
with  nominal  centered  and  maximum  scaled  RF-DNA  features.  Thus  there  is  effectively 
no  difference  between  performance  outputs  and  the  slight  differences  are  logically 
assignable  to  differing  random  values  used  in  the  Additive  White  Gaussian  Noise 
(AWGN)  process. 

While  this  shows  a  negligible  impact  of  standardization  on  classification 
performance,  the  normalization  and  standardization  method  may  still  be  important.  It  is 
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possible  that  a  DRA  and  standardization  could  lend  itself  to  improved  performance. 
Therefore,  DRA  using  a  low  number  of  features  was  pursued;  Unsealed  Summed  MDA 
Loadings  Fusion  ( USum  MLF),  Section  4.2.3. 1,  was  used  to  select  10  features.  With  the 
top  10  features,  classification  accuracy  does  not  achieve  %C=90%,  however  one  can 
potentially  get  determine  features  very  useful  for  discrimination,  as  discussed  in  Section 
IV.  With  Ndra=  10,  classification  accuracy  was  evaluated  using  MDA/ML  models,  with 
Relative  Accuracy  Percentage  (RAP)  values  are  computed  with  respect  to  the  nominal 
TST  MDA/ML  model;  between  the  nominal  and  standardized  approaches  of 
RAPtng  =  0.970  (TNG)  and  RAPtst  =  0.968  (TST)  were  computed,  indicating  that  the 
nominal  approach  offers  higher  accuracy.  Therefore,  empirically,  the  nominal  mean 
centered  and  max-scaled  RF-DNA  nonnalization  has  a  small,  but  distinct,  advantage  over 
standardization. 

6.3  Simulation  Methods,  Dependence  and  Correlation  Effects  in  RF-DNA 

Simulation  is  a  tool  used  by  operations  research  professionals  to  model  and 
understand  complex  processes  [136].  One  area  of  interest  in  simulation  research  is 
examining  steady  state  conditions  from  a  time  sampled  output.  One  commonly  then 
divides  a  steady  state  signal  into  independent  and  uncorrelated  batches.  The  batches  are 
then  examined  to  provide  insight  into  how  a  given  system  functions.  Particular  emphasis 
will  be  given  towards  signal  autocorrelation  to  determine  batch  sizes,  data 
standardization,  and  batching  means  methods  that  leverage  knowledge  of  the  signal  itself 
and  the  binning  process.  Simulation  studies  involve  collecting  input  and  output  data,  and 
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parameters  to  create  a  statistical  model  of  a  real  or  hypothetical  system  under  analysis 
[136].  Simulation  research  is  prominent  in  engineering,  business  and  operations  research 
applications  (cf.  [136,  243,  544-548]).  Simulations  can  involve  multiple  short  sets  of 
system  output  data  or  one  long-run  of  system  output  data.  When  one  long  set  of 
observation  data  is  available  and  it  is  prohibitive  to  collect  additional  data,  batches  are 
one  approach  used  to  provide  additional  data  about  steady  state  condition  [136].  Batches 
are  constructed  as  visualized  in  Figure  VI-3  as  M  subregions  of  the  sample,  with  each 
region  considered  as  a  separate  observation  and  containing  Ns  samples  per  subregion 
[243].  The  0th  batch  in  Figure  VI-3  is  considered  a  transient  region  and  is  not  used  for 
analysis.  To  ensure  that  each  batch  can  be  considered  as  a  separate  observation  of  the 
system  in  steady  state,  understanding  the  independence  and  correlation  of  batches  is 
needed  [136].  Additionally,  when  analyzing  simulation  data,  one  first  needs  to  identify 
the  point  where  the  system  reaches  steady  state  and  is  not  influenced  by  startup 
characteristics  [136].  In  other  business  analytics  domains,  similar  approaches  to  batching 
are  termed  binning,  c.f.  [199,  549]. 
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Figure  VI-3:  General  Batching  Method  for  Simulation  Output  Showing  the 
Response  Divided  into  M  Total  Batches  [243]. 


The  batching  process  in  simulation  parallels  closely  with  the  RF-DNA 
Fingerprinting  process,  as  described  in  Section  2.4,  in  that  a  signal’s  Regions  of  Interest 
(ROI)  is  divided  into  Nr  equally  sized  subregions  which  are  then  processed  for  further 
analysis.  Since  the  RF-DNA  process  yields  distinction  between  devices,  it  is  logical  that 
RF-DNA  fingerprints  computed  from  independent  measurements  will  be  useful  for 
device  discrimination.  Therefore,  methods  from  simulation  aimed  at  reducing  correlation 
effects  in  the  data  could  be  beneficial  to  RF-DNA. 


6.3.1  Transient  Determination 

Transient  periods  are  present  as  a  system  begins  to  operate,  in  Figure  VI-3  the 
transient  period  is  batch  0.  Transients  (considered  as  startup  biases)  are  detrimental  to 
simulation  studies  [550],  thus  simulation  studies  generally  desire  to  consider  only  steady- 
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state  processes  in  order  to  accurately  model  a  process  by  reducing  influence  of  startup 
characteristics  [136].  Automated  transient  detection  methods  were  proposed  by  [551]. 
Objectively,  transient  detennination  is  similar  to  discarded  initialization  region  in  RF- 
DNA.  While  automated  approaches  for  transient  detennination  could  be  applied  in  the 
RF-DNA  Fingerprinting  process,  such  approaches  are  not  considered  herein  since  the 
ROI  is  device  dependent.  However,  future  work  may  wish  to  examine  this  area  in 
conjunction  with  leveraging  knowledge  about  the  communication  signal  itself  to 
detennine  and  isolate  the  ROI  for  RF-DNA. 

6.3.2  Autocorrelation  and  the  Number  of  Batches 

Batch  size  is  another  important  question  in  simulation  analysis  [552-554]. 
Additionally,  higher  order  moments  (such  as  3rd  and  4th  order)  were  detennined  by  [555] 
to  be  more  sensitive  to  interval  differences  than  lower  order  moments.  Therefore, 
selecting  appropriately  sized  ROIs  may  be  critical  to  RF-DNA  device  classification  and 
device  ID  verification  performance.  In  simulation  studies,  normality  of  a  given  batch 
can  be  one  factor  used  to  detennine  batch  size  [556].  Various  approaches  (in  multiple 
disciplines)  exist,  e.g.  [199,  549,  557-562],  for  detennining  batch  size.  Determining  the 
appropriate  number  of  batches  to  create  minimizes  conelation  between  batches,  see 
[243,  55 1,  554],  and  it  is  of  interest  to  produce  independent  batches. 

Although  inter-feature  conelation  can  be  beneficial  to  classification  performance, 
intra-feature  conelation  (correlation  between  data  features)  generally  causes  adverse 
effects  to  classification  perfonnance  [563-565].  The  reasoning  for  this  is  that  highly 
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correlated  features  are  redundant  [566-568].  In  other  words,  if  corr(X,Y)  =  1, 
then  corr(X,Y)  —  corr(X,  aX)  —  corr(X,X),  indicating  that  no  information  was  added 
by  retaining  both  features.  Multiple  correlated  features  can  also  cause  instability  issues  in 
linear  methods  such  as  ANOVA,  logistic  regression,  linear  least  squares  regression,  and 
discriminant  analysis  [564,  568],  While  nonlinear  classifiers  can  process  correlated  data, 
e.g.  [569],  redundant  features  will  still  increase  computation  time  and  are  undesirable 
[567]. 

The  covariance  between  two  variables  X  and  Y  is  defined  as 


Cov(X,Y )  =  Cov(Y,X )  =  E[(X  -  E(X))]E[(Y  -  E(Y))], 
with  the  correlation  of  Wand  Y  being  the  scaled  covariance, 


Corr(X,  Y)  = 


Cov(X,  Y) 
yJVar(X)y/Var(Y) 


(6.2) 


(6.3) 


which  normalizes  the  covariance  to  have  values  between  -1  and  +1  [551], 

To  consider  batch  means  and  autocorrelation  computations  we  considering  a 
generic  steady-state  sequence  vector  Vn  for  n  —  1,2, ... ,  N  [243,  551],  where  N  is  the  total 
number  of  samples.  For  this  sequence  vector,  we  compute  the  steady-state  mean  as 


and  variance  as 


(6.4) 


E[{Vn-ii)2]  =  o2 .  (6.5) 

The  autocorrelation  function  for  a  sequence  vector  is  a  covariance  function  with 
properties, 
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7(0)  =  a2 


(6.6) 


yO 0  =  y(-k) 

where  K  is  an  offset  [551],  Of  interest  is  determining  the  spacing  within  a  sequence  to 
find  the  covariance  stationary  quantity 

Cov(yn,Vn+K)  =  r(K) ,  (6.7) 

for  any  n  and  K  [551].  With  these  quantities,  dependence  can  be  computed  via  the 
correlation,  where  (6.3)  is  computed  for, 


pO'O  —  Corr(V  n,Vn+K)  — 


Cov(Vn,Vn+K ) 


y(K) 


(6.8) 


yjv  ar(V  ar(y  n+K)  7(0) 

which  is  the  correlation  within  the  sequence  with  a  separation  of  [551].  Correlation  has 
various  interesting  and  useful  properties, 


P(0)  =  1 

pOO  =  P(~K)  (6.9) 

-1  <  p(K)  <  1  . 

In  RF-DNA  fingerprinting,  one  extends  this  process  by  realizing  that  K  is 
equivalent  to  Ns,  the  total  number  of  samples  in  a  subregion.  Since  Nr,  the  total  number 
of  subregions,  are  frequently  empirically  detennined,  Ns  is  also  empirically  detennined  in 
prior  RF-DNA  work,  see  [18,  59,  89].  However,  autocorrelation  could  assist  in  this 
process  by  detennining  the  number  of  time  samples-per-subregion  which  lead  to 
uncorrelated  subregions.  When  computing  the  autocorrelation  function  for  multiple 
devices,  one  aims  to  find  the  number  of  time  samples-per-subregion  associated  with  the 
smallest  autocorrelation.  For  multiple  devices,  one  should  simultaneously  compare  the 
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autocorrelation  function  of  all  devices  with  the  best  minimum  autocorrelation  across 


device  used  to  determine  ROI  size. 

The  autocorrelation  amplitude  for  the  4  authorized  ZigBee  devices  is  presented  in 
Figure  VI-4  at  SNR  =  10  dB,  along  with  a  line  of  0  autocorrelation.  Of  interest  in  Figure 
VI-4  is  when  the  autocorrelation  functions  are  at  the  0  autocorrelation  line,  which 
indicates  minimum  autocorrelation.  Figure  VI-4  shows  that  minimum  autocorrelation 
(approximately  0)  for  the  four  devices  occurs  at  autocorrelation  indices  of  24  time 
samples-per-subregion  and  48  time  samples-per-ROI.  Incidentally  48  time  samples-per- 
subregion  and  24  samples-per-subregion  correspond,  respectively,  with  1  subregion-per- 
bit  and  2  subregions-per-bit  as  explored  by  Dubendorfer  [91].  While  Dubendorfer  [91] 
employed  a  physical  understanding  of  signal  structure  and  findings  of  prior  empirical 
work  to  determine  subregion  size,  employing  autocorrelation  for  ROI  size  determination 
adds  robustness  to  this  decision. 


Figure  VI-4:  Autocorrelation  of  ZigBee  Data  Features,  SNR  =  10  dB. 
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VII.  Summary  and  Conclusions 


What  means  all  this? 

-Marcus  Aurelius,  121-180 

This  document  presents  various  theoretical,  practical,  and  application-based 
contributions  made  in  the  Radio  Frequency  (RF)  Fingerprinting  arena,  including 
advancements  in  classifier  model  development,  Dimensional  Reduction  Analysis  (DRA), 
and  AFIT’s  RF  Distinct  Native  Attribute  (RF-DNA)  Fingerprinting  process.  This  chapter 
presents  a  summary  of  the  research,  its  contributions  and  recommendations  for  future 
research. 

7.1  Research  Summary 

Simple,  low-cost  wireless  devices  permeate  the  world,  including  those  used  in 
Critical  Infrastructure  (Cl)  applications  where  they  interact  with  physical  devices. 
ZigBee  and  Z-Wave  devices  are  two  devices  and  have  well-known  security  issues  — c.f. 
[37,  38,  170]  and  are  of  interest  for  this  research.  When  considering  security  and  a 
hierarchy  of  communication  signaling,  such  as  the  seven  layer  Open  System 
Interconnection  (OSI)  model  [62-64],  security  is  generally  only  considered  within  the 
Application,  Network  and  Data  Fink  layers  [51-58].  Much  less  emphasis  has  been 
placed  on  Physical  (PHY)  layer  security,  the  interface  layer  of  signals  emanating  from  the 
device  itself,  and  extensions  of  PHY-based  RF-DNA  Fingerprinting  process  are  of 
interest  for  improving  security. 
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RF-DNA  Fingerprinting  aims  to  exploit  device  emissions  in  a  biometric-like 
manner  where  statistical  features  having  attributes  of  universality,  distinctiveness, 
permanence,  and  collectability  are  generated  and  used  for  Device  Classification  and 
Device  ID  Verification  [19,  66].  RF-DNA  fingerprints  are  statistical  in  nature  and 
involve  computing  the  variance,  skewness  and  kurtosis  within  Regions  of  Interest  (ROI) 
selected  form  instantaneous  amplitude,  phase,  and  frequency  responses.  When 
considering  RF-DNA  fingerprints,  one  must  develop  a  classifier  model  to  discriminate 
between  devices.  Previous  efforts  have  introduced  and  employed  Multiple  Discriminant 
Analysis  (MDA),  Generalized  Relevance  Learning  Vector  Quantization  Improved 
(GRLVQI),  Random  Forests,  and  Learning  From  Signals  (LFS)  [51,  90,  133,  134] 
processes  for  classification.  Herein,  the  MDA  and  GRLVQI  processes  are  considered 
and  extended.  Additionally,  RF-DNA  features  are  frequently  numerous  and  thus  DRA  is 
of  interest  to  select  appropriate  subsets  of  features.  Prior  DRA  research  in  RF-DNA  has 
considered  the  two-sample  Kolmogorov-Smimov  (KS)  test  and  GRLVQI  relevance 
ranking  values.  Herein,  multiple  extensions  to  DRA  were  made  to  introduce  new 
methods,  develop  an  MDA-based  DRA  method,  and  improve  the  understanding  of  DRA 
methods. 

Deficiencies  in  /7-value  based  DRA  were  illustrated  and  the  proposed  F-test  and 

revised  KS-test  illustrated  advantages  in  using  test  statistic  values  for  DRA.  Further 

improvements  in  DRA  included  developing  quantitative  dimensionality  assessment  DRA 

was  shown  to  remove  subjectivity  when  selecting  DRA  subsets.  MDA-based  Loadings 

Fusion  (MLF)  was  shown  to  be  an  MDA-classifier  based  DRA  method  which  resolved 
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previously  mentioned  deficiencies  in  MDA  [51,  91,  92,  113,  134,  241],  The  proposed 
autocorrelation-based  approach  to  RF-DNA  fingerprint  subregion  size  specification  was 
shown  to  add  robustness  to  the  previously  subjective  RF-DNA  fingerprinting  subregion 
specifications. 

The  proposed  F-test  and  MLF  DRA  methods  were  shown  to  offer  distinct 
perfonnance  improvements  over  the  KS-test  and  GRLVQI  DRA  methods.  ZigBee 
Device  Classification  results  for  selected  DRA  methods  with  an  MDA/ML  classifier  and 
arbitrary  average  correct  classification  (%C)  benchmark  of  %C  =  90%,  included  SNR 
gain  ( Gsnr )  relative  to  the  benchmark  GRLVQI  DRA  with  Ndra  =  50  feature  sets  of 
1)  Gsnr  =  +0.82  dB  for  SSum  MLF  DRA,  and  2)  Gsnr  =  +0.10  dB  for  F-test  DRA  using 
Ndra  =  50,  compared  to  3)  Gsnr  =  +0.71  dB  for  KS-Test  DRA  using  /Vzvl4  =  50,  and 
4)  Gsnr  =  -4.22  dB  for  the  baseline  Random  DRA  using  Ndra  =  50.  ZigBee  Device  ID 
Verification  results,  using  the  same  Ndra  =  50  feature  sets  and  MDA/ML  classifier, 
included  correct  verification  of  authorized  device  IDs  (%Vf)  and  correct  detection  of 
unauthorized  rogue  device  IDs  (%VR)  of  %VA  =  50%  %VR  =  91.67%  for  the  benchmark 
GRLVQI  DRA,  with  1)  %VA  =  50%  and  %VR  =  91.67%  for  SSum  MLF  DRA,  and  2) 
%oV a  =  75%  and  %Vr  =  91.67%  for  F-test  DRA,  compared  to  3)  %VA  =  50%  and 
%Vr  =  86.11%  for  the  KS-test,  and  4)  %VA  =  50%  and  %Vr  =  75%  for  the  baseline 
Random  DRA.  Thus  the  proposed  SSum  MLF  DRA  and  F-Test  DRA  offer  a 
perfonnance  advantage  over  both  GRLVQI  DRA  and  KS-Test  DRA  while  being 
computationally  and  conceptually  simpler  DRA  methods. 
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The  optimized  GRLVQI  algorithm  and  the  proposed  GRLVQI-D  algorithm 
showed  improved  performance  over  the  baseline  GRLVQI  algorithm.  When  considering 
GRLVQI  classifier  improvements  using  Nf  =  189  Z-Wave  features  and  the  %C  =  90% 
benchmark,  demonstrated  Device  Classification  perfonnance  relative  to  baseline 
GRLVQI  using  a  squared-Euclidean  distance  measure  includes  1)  improved 
Gsnr  =  +1.94  dB  using  the  GRVLQI  optimized  algorithm,  and  2)  improved  Gsnr  =  +1.84 
dB  using  GRLVQI-D  with  a  Cosine  distance  measure.  For  Z-Wave  Device  ID 
Verification,  results  include  1)  worst  case  %Va  =  33.33%  for  baseline  GRLVQI, 
2)  improved  %oV a  =  66.66%  for  GRLVQI-D  using  a  Cosine  distance  measure,  and  3)  best 
case  %Va  =  100%  using  the  optimized  GRLVQI  algorithm.  Due  to  availability,  Z-Wave 
devices  were  not  present  for  rogue  device  assessments.  When  ZigBee  RF-DNA 
fingerprints  were  considered  using  the  Z-Wave  optimized  GRLVQI  and  GRLVQI-D 
algorithms,  perfonnance  was  worse  than  the  nominal  settings  of  Reising  [51],  indicating 
that  the  Z-Wave  optimal  settings  and  not  applicable  to  ZigBee  device  discrimination. 

7.2  Research  Contributions 

Three  primary  contributions  were  made  under  this  research,  including 
improvements  to  1)  the  Dimensional  Reduction  Analysis  (DRA)  methodology,  2)  the 
GRLVQI  classifier,  and  3)  the  RF-DNA  Fingerprinting  process.  A  summary  of  each 
follows: 

1.  DRA  Improvements:  Includes  development  and  analysis  of  MDA 
Loadings  Fusion  (MLF)  methods  to  rectifying  the  reported  issue  in,  c.f. 
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[51,  91,  92,  113,  134,  241],  that  includes  MDA  lacking  a  classifier-based 
relevance.  An  F-test  DRA  method  was  introduced  and  shown  to  offer 


reasonable  performance.  Quantitative  DRA  assessment  methods  were 
developed  to  determine  the  number  of  retained  features  ( Ndra )  and  their 
perfonnance  compared  with  previous  qualitative  DRA  methods  of  [91], 
Prior  RF-DNA  DRA  efforts  have  considered  /;- values  for  feature 
relevance  ranking  [89,  113].  However,  phenomenological  issues  exist 
with  such  an  approach,  an  improved  understanding  is  developed  herein 
based  on  the  merits  of  /;- values  versus  test  statistics  for  feature  relevance 
ranking.  Finally,  a  preliminary  investigation  into  DRA  relevance  fusion 
was  presented. 

2.  GRLVQI  Classifier  Improvements:  involved  changing  the 

underlying  distance  measure.  To  do  so,  one  must  necessarily  change  the 
cost  function  and  derivatives  to  the  GRLVQI  algorithm.  Since  a) 
GRLVQI  is  a  rather  complicated  algorithm  and  b)  many  different  distance 
measures  exist,  a  procedure  to  select  different  distance  measures  was 
created  that  involved  first  comparing  distance  measures  themselves  and 
then  iteratively  incorporating  a  distance  measure  into  successively  more 
complicated  learning  vector  quantization  (LVQ)  algorithms  leading  up  to 
GRLVQI.  For  this  process,  the  first  known  derivative  framework  for  the 
LVQ-family  of  algorithms  was  developed.  Subsequently,  an  optimization 
approach  was  presented  to  detennine  reasonable  algorithm  parameter 
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settings  for  the  baseline  GRLVQI  process  and  the  newly  developed 
distance-based  GRLVQI  process  (GRLVQI-D). 

3.  RF-DNA  Fingerprinting  Improvements:  An  enhanced 

understanding  of  the  nature  of  instantaneous  amplitude,  phase,  and 
frequency  features  was  developed  to  better  understand  why  phase  features 
have  historically  been  the  most  relevant  for  device  classification.  An 
autocorrelation  method  was  developed  and  characterized  to  automate  the 
determination  of  the  number  of  subregions  used  within  a  given  response 
ROI.  Finally,  a  first-look  assessment  of  simulation-based  ROI  weighting 
schemes  was  completed  for  RF-DNA  Fingerprinting. 

7.3  Proposed  Future  Research 

Given  the  methods  developed  under  this  research  and  corresponding  findings, 
many  different  future  research  endeavors  could  be  pursued.  The  following  are  proposed: 

1.  Additional  GRLVQI  Algorithm  Extension:  Herein,  distance  measures 
and  the  relative  distance  difference  equation  were  changed  in  the  GRLVQI 
algorithm.  However,  future  work  could  consider  different  activation 
functions,  e.g.  [570],  to  replace  the  sigmoid  operation  in  GRLVQI.  The 
presented  LVQ-family  derivative  skeleton  would  be  an  initial  starting 
point  in  this  effort. 

2.  Tailor  Algorithmic  Optimization  to  the  Signal  of  Interest:  Optimizing 
the  GLRVQI  algorithmic  settings  was  considered  for  Z-Wave  data  and 


237 


shown  to  be  viable.  When  these  settings  were  applied  to  the  ZigBee 
dataset,  performance  was  degraded  relative  to  the  baseline.  To  compute 
optimal  settings  for  the  ZigBee  dataset,  one  would  require  many 
algorithmic  runs  which  would  be  computationally  costly.  To  facilitate 
large-scale  algorithm  optimization  studies,  employing  the  Air  Force 
Research  Laboratory  DOD  Supercomputing  Resource  Center  (DSRC) 
should  be  considered.  Employing  DSRC  would  facilitate  tailored 
GRLVQI  settings  to  a  given  signal  of  interest,  in  addition  to  permitting 
comparing  different  optimization  methods. 

3.  Extend  DRA  Methods:  Herein,  two  additional  DRA  methods  (F-test  and 
MDA  Loadings)  were  introduced  for  RF-DNA  Fingerprinting 
applications.  Additional  DRA  methods  are  identified  in  literature  and 
could  be  considered,  including  entropy  [76],  Best  Individual  Features 
[213,  571],  Logistic  Principal  Component  Analysis  (PCA)  [572], 
nonlinear  PCA  [213],  kernel  PCA  [213],  and  Independent  Component 
Analysis  (ICA)  [213,  573]. 

4.  Revisit  DRA  Fusion:  The  DRA  fusion  methods  considered  herein 
demonstrated  some  utility  at  lower  Nora  values.  This  could  be  explored 
further  to  identify  other  alternate  DRA  fusion  schemes. 

5.  Further  Consider  Simulation  Methods:  Autocorrelation  methods  from 
Simulation  were  shown  to  be  applicable  to  RF-DNA  Fingerprinting. 
Additional  Simulation  methods  that  consider  weighting  distributions  to 
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reduce  correlation  effects,  e.g.  [136,  544-546,  574-578],  could  be 
developed  and  applied  to  Region  of  Interest  (ROI)  subregions. 

6.  Explore  RF-DNA  Feature  Phenomenology:  It  was  seen  that 
instantaneous  phase  features  are  generally  more  relevant  than  both 
amplitude  or  frequency  features  and  some  insight  was  developed  to 
address  this.  However,  to  better  understand  the  relationship  between 
feature  type  and  their  relevance  to  the  classification  decision,  additional 
studies  could  be  performed.  In  this  case,  one  could  consider  simulated 
devices  (agnostic  of  modulation)  and  similar  devices  that  differ  only  by 
the  modulation  they  employ. 
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APPENDIX  A:  Lemma  Associated  with  Multiple  Discriminant  Analysis  Loadings 


Learning  is  essentially  hard;  it  happens  best  when  one  is  deeply  engaged  in  hard 

and  challenging  activities. 

-Seymour  Papert,  1928  — 

Lemma  /:  if  a  is  a  scalar,  b  is  a  vector,  and  X  is  a  matrix,  then  if  one  is  computing 
the  correlation  of  abTX  and  X  then  corr{X,  abTX )  =  corr{X,  bTX). 

To  prove  Theorem  1,  the  scaling  will  be  represented  as  eigenvectors 

b*  —  ab,  (A.l) 

scaled  by  a  scalar  a  [237].  If  the  projection  matrix  were  scaled,  as  in  (A.l),  then  the 
relationship  in  (3. 1 1)  could  be  expressed  as 

corr(X,b*TX )  =  corr(X,X)D^2b*\b*T  covlX.X^b*]^1/2 ,  (A. 2) 

which  expands  to 

corr{X,abT X )  =  corr(X , X)D^2 ab\abT cov(X , X)ab]~ 1/2  .  (A. 3) 

Equation  (A. 3)  can  be  expanded  to 

corr{X ,  X)D^2  aha_1[(hT  cov{X ,  X)b)]~^2 
—  corr(X,  abTX )  , 

which  means  the  scaling  multiplier  can  cancel,  yielding  the  conclusion  that  scaling  the 
loadings  does  not  change  the  loadings, 

corr(X,abTX)  =  corr(X,bTX).  (A. 5) 
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APPENDIX  B:  Examination  of  LVQ  and  GLVQ  Properties  and  Features 


You  can’t  process  me  with  a  normal  brain... 

-Carlos  Estevez,  1965  - 


The  GLVQ,  GRLVQ,  and  GRLVQI  relative  distance  measure  in  (3.34)  deserves 
some  understanding  of  what  this  actually  measures.  A  simple  example  can  be 
constructed  with  a  simple  example.  Consider  a  hypothetical  space  presented  in  Figure  B- 
1  where  there  are  two  hypothetical  PVs,  placed  at  (-1,  1)  and  (-1,  -1)  respectively,  and 
an  exemplar  at  (1,  1).  The  squared  Euclidean  distances  between  the  exemplar  and  each 
PV  are  respectively 

dpvi  —4  (B.l) 

and 


lPV2 


=  8. 


(B.2) 


1.5 

1 


g  0.5  b 


cd 

o 

o 

h-l 


0 

^  -0.5 
-1  b 
-1.5 


•  Exemplar 

★  PV, 

▲  PV„ 


-1.5  -1  -0.5  0  0.5  1  1.5 

X  Location 

Figure  B-l:  Hypothetical  Situation  with  Two  PVs  and  One  Exemplar 
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To  consider  the  output  of  the  relative  distance  measure  in  (3.34),  one  can  consider 
two  situations,  1)  PV)  being  the  correct  in-class  PV  or  2)  PV2  being  the  correct  in-class 
PV.  For  case  1),  the  relative  distance  difference  measure  returns  a  score  of -0.3333,  but 
in  2)  the  relative  distance  difference  measure  returns  a  score  of  0.3333.  Per  the 
discussion  in  Section  3.3. 1.6  on  interpreting  the  distance  difference  measure,  negative 
values  are  indicative  of  correct  classification  and  positive  values  are  incorrect 
classification  with  the  magnitude  indicating  how  “correct”  or  “incorrect.” 

To  extend  this  example  of  how  the  PVs,  exemplar,  distance  measure,  and  relative 
distance  difference  interact,  one  can  extend  this  example  to  compute  the  distance  of  every 
point  to  the  two  stationary  PVs.  Figure  B-2  presents  the  squared  Euclidean  distance  for 
every  point  (0.01  sampling)  between  -4  and  4  and  the  two  PVs.  Figure  B-2a  presents  the 
values  where  PVi  is  considered,  and  Figure  B-2b  presents  the  values  where  PV2  is 
considered.  Logically,  the  distances  fonn  circles  of  increasing  distance  from  the 
respective  PVs. 


X  Location  X  Location 

a)  Distances  with  respect  to  PVi  b)  Distances  with  respect  to  PV2 

Figure  B-2:  Distances  Between  Exemplars  and  a)  PVi  and  b)  PV2 
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Considering  the  relative  distance  difference  metric  for  Figure  B-2,  and  assuming 
the  PVi  is  the  correct  classification,  one  sees  Figure  B-3.  Here  one  can  see  that  the  scores 
go  to  -1  as  one  approaches  PVi  and  +1  as  one  approaches  PV2  with  curves  of  different 
values  around  each  PV.  As  PViand  PV2  move  closer  together,  one  finds  that  most 
possible  points  for  an  exemplar  are  scored  near  0,  while  only  scores  extremely  close  to 
each  PV  receive  higher  magnitude  scores,  as  seen  in  Figure  B-4. 


X  Location 


Figure  B-3:  General  Relationship  Between  Distance  Difference  Measure  and  PV 

Distances 


X  Location 
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Figure  B-4:  Relationship  Between  Distance  Difference  Measure  and  PV  Distances 

for  Closely  Spaced  PVi  and  PV2 
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APPENDIX  C:  P-values  versus  Test  Statistics  on  Selected  Academic  Datasets 


...the  primary  product  of  a  research  inquiry  is  one  or  more  measures  of  effect 

size,  not  p  values. 

-Jacob  Cohen,  1923  -  1998 

Section  4. 2. 1.3  showed  that  /^-values  were  largely  deficient  as  a  feature  relevance 
ranking  tool  for  RF-DNA  due  to  /^-values  1)  being  computed  beyond  machine  precision, 
2)  having  less  resolution  than  test  statistic  values,  3)  converging  on  zero,  and  4)  offering 
slightly  less  classification  performance  than  test  statistic  relevance  ranking.  However, 
this  was  only  a  single  example  on  a  specific  problem;  therefore  this  appendix  presents 
empirical  demonstrations  on  academic  datasets  to  show  that  this  problem  is  not  unique  to 
RF-DNA. 

To  examine  the  generalizability  of  p-v alue  versus  test  statistic  feature  ranking,  a 
selection  of  academic  datasets  was  examined  as  presented  in  Table  C-l.  Table  C-l 
presents  a  consistent  amount  and  variety  of  data  as  examined  in  [579].  The  datasets 
consist  of  well-known  multivariate  problems  and  range  in  size  from  30  exemplars,  3 
features,  and  3  classes  in  Insect  to  60,000  exemplars,  717  features,  and  10  classes  in 
MNIST. 

All  datasets  were  considered  using  the  KS-test  and  F-test  feature  relevance 
ranking  methods,  consistent  with  Section  4.2. 1.3.  To  compute  /^-values,  with  the 
exception  of  MNIST,  no  separation  into  training  and  testing  sets  were  pursued  and  all 
datasets  were  considered  in  their  entirety. 
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Table  C-l:  Example  Academic  Datasets. 


Dataset 

Number  of 
Samples  in  Classes 

Number  of 
Features 

Total 
Number  of 

Exemplars 

Reference 

Setosa:  50 

Fisher 

Versicolor:  50 

4 

150 

[235,  580] 

Virginica:  50 

Species  1:  10 

Insect 

Species  2:  10 
Species  3:10 

3 

30 

[466,  467] 

Spondylolisthesis: 

Vertebral 

Column 

150 

Normal:  100 

6 

310 

[581] 

Disk  Hernia:  60 

Wine 

Quality 

White:  4898 

Red:  1599 

11 

6497 

[582] 

Wisconsin 

Breast 

Cancer 

Benign:  458 
Malignant:  241 

9 

699 

[583] 

Cultivar  1:  59 

Wine 

Cultivar  2:71 

Cultivar  3:  48 

13 

178 

[584] 

1:  6742 

6:  5918 

MNIST 

2:  5958 

7:  6265 

(TRAINING 

3:  6131 

8:  5851 

784 

60,000 

[585,586] 

set) 

4:  5842 

9:  5949 

5:  5421 

0:  5923 

Cytoplasm:  143 

Ecoli 

Inner  Membrane:  116 
Perisplasm:  52 
Outer  Membrane:  25 

7 

336 

[587] 

Fisher  Iris  was  first  examined  using  the  /;- value  and  test  statistic  approaches 
described  in  Section  4. 2. 1.3.  The  Fisher  Iris  dataset  is  a  commonly  used  academic 
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discrimination  problem  that  contains  measurements  of  petals  and  sepals  for  three  species 
of  Iris  flowers:  setosa,  versacolor,  and  virginica.  This  dataset  contains  50  observations 
per  class,  no  missing  values,  and  four  data  features:  petal  length,  petal  width,  sepal 
length,  and  sepal  width  [235].  Table  C-2  presents  a  similar  comparison  of  features  as  in 
Table  IV-2;  however,  since  Fisher  Iris  consists  of  only  4  features  the  features  are  not 
sorted  and  the  test  statistic  values  represent  the  actual  values  for  those  features.  Again,  as 
in  Section  4.2. 1.3,  many  /^-values  were  computed  as  values  beyond  machine  zero  while 
their  associated  test  statistic  values  are  reasonable. 


Table  C-2:  p-values  vs  Test  Statistic  for  Fisher  Iris 


Feature 

Number 

F-Test 

KS-Test 

Test 

Statistic 

P-VAUUE 

Summed 

test 

STATISTIC 

Summed  p- 

VAUUE 

1 

119.26 

1.67T0'31 

9.400 

1.74- 10'21 

2 

49.16 

4.49- 10'17 

2.4733 

1.68-10'22 

3 

1,180.20 

2.86- 10'91 

1.800 

1.91-1 0"21 

4 

960.00 

4.17T0'85 

2.5733 

2.84- 10'30 

Variance 

332,880.0 

5.04-1  O'34 

12.7836 

1.02-1  O'42 

The  Insect  data  considers  three  species,  10  observations  each  with  no  missing 
values,  of  chaetocnema  insects  [499,  500].  Data  feature  here  correspond  to:  width  of  the 
frist  joint  of  the  first  tarsus  (microns),  width  of  the  first  joing  of  the  second  tarsus 
(microns),  and  maximal  width  of  the  aedegus  (microns)  [499,  500].  While  no  /^-values 
below  machine  precision  were  computed,  Table  C-3  shows  again  the  value  of  test- 
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statistic  ranking  over  value  ranking  since  the  differences  between  KS-test  p-values  are 
very  small  to  be  imperceptible. 


Table  C-3:  p-values  vs  Test  Statistic  for  Insect. 


Feature 

Number 

F-Test 

KS-Test 

Test 

Statistic 

P-VAUUE 

Summed 

test 

STATISTIC 

Summed  p- 

VAUUE 

1 

64.88 

0.00 

2.77 

1.11-1 0”8 

2 

1.36 

0.27 

1.77 

1.1  MO"8 

3 

1.12 

0.34 

2.0 

3.59T0"14 

Variance 

1,350.1 

0.033 

0.27 

4.09T0"17 

The  vertebral  column  dataset  considers  spine  measurements  and  normal  and 
abnonnal  disk  issues,  such  as  Disk  Hernia  and  Spondylolisthesis  [584],  When  examining 
the  vertebral  column  dataset,  Table  C-4,  many  /^-values  are  seen  as  being  computed 
beyond  machine  precision.  However,  the  test  statistic  values  offer  more  perceptible 
differences  between  features. 

Wine  Quality  considers  various  chemical  properties,  e.g.  acidity  and  sulphates,  in 
the  Portuguese  "Vinho  Verde"  wine  and  their  relationship  with  a  quality  score  [582]. 
Table  C-5  presents  results  for  the  KS-test  and  F-test  DRA  approaches;  while  all  but  two 
KS-test  summed  /^-values  were  equal  to  exactly  zero  with  the  non-zero  values  being 
below  machine  precision,  the  KS-test  statistic  value  offers  a  seemingly  reasonable 
approach  to  rank  features.  A  similar  result  is  also  seen  in  the  F-test  for  this  data. 
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Table  C-4:  p-values  vs  Test  Statistic  for  Vertebral  Column. 


Feature 

Number 

F-Test 

KS-Test 

Test 

Statistic 

P-VAUUE 

Summed 

test 

STATISTIC 

Summed  p- 

VAUUE 

1 

98.537 

8.77- 10'34 

3.129 

4.88-10'7 

2 

21.298 

2.22-10'9 

3.787 

3.42-10'16 

3 

114.988 

5.3410'38 

2.777 

4.89-10'7 

4 

89.647 

2.17T0'31 

2.923 

1.41-1 0"9 

5 

16.869 

1.12-10'7 

4.823 

4.69- 10'122 

6 

119.127 

5.10-1 0"39 

3.013 

3.42-10'16 

Variance 

2,111.9 

2.07-10'15 

0.602 

6.36-10'14 

Table  C-5:  p-values  vs  Test  Statistic  for  Wine  Quality. 


Feature 

Number 

F-Test 

KS-Test 

Test 

Statistic 

P-VAUUE 

Summed 

test 

STATISTIC 

Summed  p- 

VAUUE 

1 

8.00 

1.26T0'8 

9.31 

0.0 

2 

96.67 

8.44-10'117 

8.64 

9.86- 10'21 

3 

9.31 

3.44- 10'10 

8.58 

9.86- 10'21 

4 

9.11 

5.97-10'10 

8.48 

0.0 

5 

50.85 

1.95T0'61 

9.84 

0.0 

6 

14.94 

4.77-10'17 

9.17 

0.0 

7 

7.72 

2.77-10'8 

9.66 

0.0 

8 

136.95 

6.58-10'164 

9.96 

0.0 

9 

2.02 

0.06 

9.48 

0.0 

10 

4.33 

2.31-10'4 

9.19 

0.0 

11 

320.59 

0.0 

9.45 

0.0 

Variance 

9,434.7 

3.19-1 0‘4 

0.25 

1.59-1 0‘6 
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The  Wisconsin  Breast  Cancer  dataset  concerns  various  parameters  about  potential 
breast  masses  for  a  classification  of  benign  or  malignant  [583],  As  seen  in  the  other 
examples,  Table  C-6  presents  results  for  the  KS-test  and  F-test  DRA  approaches.  Again, 
for  both  approaches,  test  statistic  values  are  seen  to  provide  results  which  are  real 
numbers  and  not  beyond  machine  precious  or  infinitesimally  small. 


Table  C-6:  p-values  vs  Test  Statistic  for  Wisconsin  Breast  Cancer. 


Feature 

Number 

F-Test 

KS-Test 

Test 

Statistic 

P-VAUUE 

Summed 

test 

STATISTIC 

Summed  p- 

VAUUE 

1 

733.21 

6.84T0'111 

3.05 

7.92- 10'21 

2 

1,408.5 

1.75T0'169 

1.744 

0.66 

3 

1,419.3 

2.95- 10'170 

1.743 

0.51 

4 

657.79 

1.1  MO'102 

1.84 

0.48 

5 

608.72 

4.3510'97 

3.78 

9.40-10'9 

6 

1,014.2 

4.54- 10'138 

2.02 

1.18-1 0"4 

7 

933.29 

9.85T0'131 

2.79 

9.40-10'9 

8 

717.63 

3.12TO'109 

1.99 

0.32 

9 

152.04 

9.6810'32 

3.30 

4.51-10'12 

Variance 

160,430 

1.04-1 0'63 

0.59 

0.07 

The  wine  dataset  is  conceptually  similar  to  the  wine  quality  dataset,  however  here 
we  are  interested  in  discriminating  between  three  different  grape  cultivars  [584].  Similar 
to  the  other  example  datasets,  /^-values  are  again  computed  beyond  machine  precision 
and  offer  less  obvious  interpretability  as  that  seen  in  the  test  statistic  values.  However, 
one  issue  does  exist  in  the  KS-test  statistic  values  with  feature  5  and  13  producing 
identically  valued  test  statistics,  but  this  is  the  only  occurrence  of  this  problem  and 
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despite  this  issue  the  test  statistic  values  still  appear  to  offer  more  consistent  and 
interpretable  relevance  ranking  values. 


Table  C-7:  p-values  vs  Test  Statistic  for  Wine 


Feature 

Number 

F-Test 

KS-Test 

Test 

Statistic 

P-VAUUE 

Summed 

test 

STATISTIC 

Summed  p- 

VAUUE 

1 

135.07 

3.32- 10'36 

11.93 

3.18-1 0"71 

2 

36.94 

4.13-10'14 

7.94 

0.002 

3 

13.31 

4.15-10'6 

9.19 

5.978- 10'7 

4 

35.77 

9.44- 10'14 

11.92 

3.18-1 0"71 

5 

12.43 

8.96-10'6 

12.00 

7.99- 10'79 

6 

93.73 

2.14-10'28 

8.12 

3.11-1 0"4 

7 

233.93 

3.5910'50 

7.73 

0.0017 

8 

27.58 

3.88T0'11 

11.75 

2.96- 10'62 

9 

30.27 

5.13-10'12 

8.92 

1.79-10'7 

10 

120.66 

1.16-1 0"33 

10.19 

1.48-1  O'27 

11 

101.32 

5.92-1  O'30 

10.95 

1.20-1  O'35 

12 

189.97 

1.39-1  O'44 

8.52 

4.45-10'6 

13 

207.92 

5.78-10'47 

12.00 

7.99- 10'79 

Variance 

6,040.10 

7.03-10'12 

3.02 

4.77-10'7 

Written  character  recognition  is  a  concern  in  many  fields,  e.g.  [588-592],  MNIST 
is  a  dataset  that  considers  thousands  of  handwritten  digits  [585,  586].  MNIST’s  data 
features  are  actually  pixels  in  an  28x28  image,  with  each  of  the  60,000  observations 
containing  one  image  of  one  handwritten  digit  [585,  586],  However,  the  final  image  is 
really  20x20  since  there  is  a  band  of  0s  around  the  20x20  image  [585,  586].  Table  C-8 
presents  results  when  the  KS-test  and  F-test  DRA  approaches  are  applied.  Values  are 
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sorted  from  lowest  to  highest  based  on  the  respective  test-statistic  value,  consistent  with 
those  presented  for  RF-DNA.  Notably,  this  dataset  shows  that  the  F-test  failed  to 
produce  a  test  statistic  value  in  some  cases,  while  the  KS-test  did  not.  However, 
underlying  this  issue  is  the  data  itself;  many  observations  in  some  features  were  all  Os, 
therefore  such  a  result  is  understandable  since  the  KS-test  is  comparing  two  distributions 
and  the  distributions  of  two  vectors  of  all  zeros  is  identical.  Therefore,  the  KS-test  has  no 
issue  with  handling  such  data,  while  the  F-test  does. 

Table  C-8:  p-values  vs  Test  Statistic  for  MNIST. 


Feature 

Number 

F-Test 

KS-Test 

Test 

Statistic 

P-VAUUE 

Summed  test 

statistic 

Summed  p- 

VAUUE 

1 

NAN 

NAN 

420.83 

0.4576 

2 

NAN 

NAN 

418.96 

0.4576 

3 

NAN 

NAN 

417.34 

4.13T0'4 

4 

NAN 

NAN 

409.15 

6.3910'4 

68 

3.17 

7.8-10'4 

323.49 

0.06 

69 

2.54 

0.0065 

323.14 

0.99 

783 

0.18 

0.996 

143.49 

8.32 

784 

0.15 

0.998 

143.48 

9.48 

Variance 

NaN 

NaN 

5,038.9 

14,094.0 

The  Ecoli  dataset  considers  measurements  of  various  Ecoli  cells  relating  to 
different  biological  aspects  [587].  The  original  dataset  contains  eight  classes,  related  to 
the  localization  site  of  the  Ecoli  [587].  This  was  condensed  into  four  groups  (Cytoplasm, 
Inner  Membrane,  Perisplasm,  and  Outer  Membranes)  due  to  the  presence  of  very  small 
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minority  classes.  When  the  KS-test  and  F-test  DRA  methods  are  applied,  again  one  see 
the  recurring  issues  with  p-value  but  not  with  test  statistic  values,  Table  C-9. 


Table  C-9:  p-values  vs  Test  Statistic  for  Ecoli. 


Feature 

Number 

F-Test 

KS-Test 

Test 

Statistic 

P-VAUUE 

Summed 

test 

STATISTIC 

Summed  p- 

VAUUE 

1 

52.34 

8.3010'50 

1.65 

0.039 

2 

61.94 

2.65- 10'56 

1.69 

0.11 

3 

109.46 

6.84- 10'82 

3.59 

1.00-1  O'36 

4 

46.58 

1.32-1  O'45 

3.68 

1.07-1  O'36 

5 

28.18 

2.76- 10'30 

1.79 

0.11 

6 

181.38 

1.03-10'108 

1.68 

0.43 

7 

93.65 

2.36- 10'74 

1.78 

0.41 

Variance 

2,700 

1.09-1  O'60 

0.88 

0.03 

Of  particular  interest  was  the  generalizability  of  the  benefits  of  test-statistic 
feature  relevance  ranking  over  p-value  for  feature  relevance  ranking.  This  was 
demonstrated  in  all  cases  except  MNIST.  This  was  again  due  to  the  representative 
academic  dataset  having  a  machine  precision  issue  when  using  p-values  for  feature 
relevance  ranking,  but  not  when  using  test  statistics.  Some  statistical  software  truncates 
p-values  at  a  certain  point,  e.g.  JMP  truncates  p-values  and  list  them  as  “<0.0001”  [593], 
to  avoid  computing  infinitesimally  small  values.  While  such  an  approach  would  avoid 
presenting  and  using  values  beyond  machine  precision,  such  approaches  are  logically  also 
insufficient  for  feature  relevance  ranking.  No  such  issues  existed  with  the  test  statistic 
values,  and  only  in  the  Wine  dataset  were  two  identical  test  statistical  values  computed 
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for  two  features  using  the  KS-test;  however,  this  was  the  only  occurrence  of  this  type  of 
problem  seen  across  all  of  this  datasets  and  does  not  negate  the  various  obvious  issues 
seen  in  the  p-v alue  rankings. 

Throughout  all  of  these  academic  datasets  and  the  ZigBee  RF-DNA  dataset,  no 
such  issues  existed  for  the  test  statistic  relevance  ranking.  This  both  illustrates  the 
generalizability  of  the  results  in  Section  4. 2. 1.3  to  a  wide  range  of  problems  and  dataset 
sizes  and  empirically  verifies  the  recommendation  of  [365]  regarding  /^-values  and 
feature  relevance  ranking. 

As  seen  in  the  MNIST  data,  KS-test  has  the  benefit  that  variables  consisting  of  all 
Os  or  identical  values  can  still  be  examined,  while  the  F-test  does  not.  However,  such 
situations  indicate  that  variables  with  such  conditions  will  make  the  data  singular  or 
nearly  singular,  which  will  preclude  further  analysis  in  MDA  or  other  linear  classifiers. 
Nonlinear  and  ANN  based  classifiers  may  still  be  able  to  operate  on  such  data,  however 
variables  that  are  identically  one  value  would  be  necessarily  redundant. 
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APPENDIX  D:  DRA  Method  Fusion  Classification  and  Verification  Performance 

Assessments 


The  chess  board  is  the  world,  the  pieces  the  phenomena  of  the  universe,  the  rules 
of  the  game  are  what  we  call  the  laws  of  nature.  The  player  on  the  other  side  is 
hidden  from  us.  All  we  know  is  that  his  play  is  always  fair,  just  and  patient.  But, 
also,  that  he  never  overlooks  a  mistake  or  makes  the  smallest  allowance  for 

ignorance. 

-Thomas  Henry  Huxley,  1825  -  1895 

By  considering  the  DRA  fusion  methods  in  Section  4.2.4  one  can  determine  if 
fusion  of  DRA  methods  offers  any  performance  benefit.  MDA/ML  models  were 
constructed  using  the  DRA  fusion  methods  and  then  classification  and  verification 
accuracy  of  each  model  are  presented,  respectively,  in  Table  D-l  and  Table  D-2.  Table 
D-l  shows  that  DRA  fusion  methods  achieve  consistently  worse  performance  than  the 
best  result  seen  in  the  DRA  methods  by  themselves  (presented  in  the  last  column  of  Table 
D-l).  However,  while  score  and  rank  fusion  offer  consistently  poor  performance, 
concatenation  DRA  fusion  offers  performance  similar  performance  to  the  original  DRA 
methods.  Thus  concatenation  DRA  fusion  might  be  viable  since  it  balances  the 
contributions  and  weaknesses  of  various  methods. 
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Table  D-l:  Relative  DRA  “Gain”  (dB)  Over  Baseline  Performance  for  %C  =  90% 
Classification  Accuracy  for  DRA  Fusion  Methods.  Bold  entries  denote  values  within 
10%  of  the  Best,  and  bold  entries  with  light  grey  shading  denote  best  case 

performance. 


Set 

Fusion  Method 

Best  Result 

from 

Table  IV-6 

Score 

Rank 

Concatenate 

N  =26 

F 

Training 

-18.462 

- 

-13.215 

-13.347 

Testing 

-18.393 

- 

-13.852 

-13.817 

N  =50 

F 

Training 

-8.712 

-16.972 

-9.324 

-7.697 

Testing 

-8.513 

-17.343 

-9.482 

-7.967 

N  =100 

F 

Training 

-4.732 

-12.532 

-4.105 

-3.387 

Testing 

-4.643 

-12.563 

-4.002 

-3.407 

N  =157 

F 

Training 

-2.792 

-10.822 

-2.475 

-2.207 

Testing 

-2.683 

-10.773 

-2.272 

-2.357 

N  =191 

F 

Training 

-2.362 

-10.152 

-2.095 

-1.767 

Testing 

-2.303 

-10.223 

-1.972 

-1.917 
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Table  D-2:  Device  ID  Verification  Performance  for  %C  =  90%  at  SNR  =  10  dB: 
True  Verification  Rate  (TVR)  for  N/Ul,i,  =  4  Authorized  Devices  and  Rogue  Rejection 
Rate  (RRR)  For  NAuth  xNRog  =  36  rogue  scenarios.  Bold  entries  denote  values 
within  10%  of  the  Best,  and  bold  entries  with  light  grey  shading  denote  best  case 

performance  and. 


Set 

Fusion  Method 

Best  Result 
From  Table 

IV-8 

Score 

Rank 

Concatenate 

N  =10 

F 

Authorized 

0 

0 

25 

50 

Rogue 

19.44 

0 

38.89 

52.78 

N  =26 

F 

Authorized 

50 

0 

25 

50 

Rogue 

66.67 

0 

75 

80.56 

N  =50 

F 

Authorized 

50 

0 

50 

75 

Rogue 

88.89 

0 

86.11 

91.67 

N  =100 

F 

Authorized 

75 

0 

75 

100 

Rogue 

97.22 

11.11 

94.44 

94.44 

N  =157 

F 

Authorized 

100 

25 

75 

100 

Rogue 

97.22 

41.67 

94.44 

94.44 

N  =191 

F 

Authorized 

100 

50 

100 

100 

Rogue 

97.22 

55.56 

97.22 

94.44 

The  verification  results  from  DRA  fusion,  Table  D-2,  show  a  similar  deficiency  in 
DRA  fusion  methods  as  seen  in  Table  D-l.  Again,  DRA  fusion  methods  consistently 
underperform  individual  DRA  methods  for  verification,  particularly  at  low  Ndra-  At 
higher  Ndra,  e.g.  Ndra  =  [100,  157,  191],  DRA  fusion  methods  are  seen  to  achieve 
comparable  or  better  perfonnance  to  the  individual  DRA  methods.  However,  this  it 
should  be  taken  in  consideration  that  the  performance  differences  seen  are  very  slight. 
Thus  DRA  fusion  methods  have  limited  applicability  to  RF-DNA  classifier  model 
development  when  compared  to  using  the  original  DRA  methods. 
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APPENDIX  E:  Gradient  Descent  and  Derivatives  in  GLVQ  Family  Algorithms 


...artificial  networks  need  not  imitate  biology. 

-Teuvo  Kohonen,  1934  - 

In  GLVQ  the  cost  function  is  no  long  the  distance  measure  itself  and  is  now 
expressed  as  a  function  of  both  a  sigmoid,  (3.33),  and  a  relative  distance  measure,  (3.34), 
which  is  itself  a  function  of  both  the  nearest  in-class  and  out-of-class  distances.  Overall, 
these  changes  complicate  the  derivation  process  and  the  process  must  be  examined 
closely. 

The  cost  function  itself  is  first  examined.  Correctly,  to  compute  the  first 
derivative,  one  must  consider  that  the  derivative  is  with  respect  to  the  appropriate  PV,  w] 
orwL.  However,  since  the  in/out-of-class  aspect  of  the  PV  is  not  functionally  relevant 
this  can  be  generalized  as  df(g(xm))/dw.  First,  considering  3/(^(xrn))/d/r(xrn),  one 
must  realize  that  /i(xm)  is  a  function  within  /(/r(xm)),  therefore  this  can  be  solved  via 
the  chain  rule  as  described  in  (3.22).  With  this  approach,  the  gradient  of  the  cost  function 
can  be  computed  as 

d/(/r(xm))  d/(>(x™))  d/t(x™) 

dw  dii(xm )  dw  '  (E-l) 

with 
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df{y.  (xm)) 
d/j.(xm ) 


/'(M(xrn))^'(xrn) 


(E.2) 


where  /'(/^(x”1))  =  — (l  +  e  ^ -*)  due  to  the  fonnulation  in  (3.32)-(3.34)  thus 


yielding  the  following 


d/Q*(xm))  /  1  \2  g  , 

dix{xm>)  \1  +  d/r(xm)  ^ 

which,  because  of  the  expression  in  (D.2),  reduces  to 


(E.3) 


d/Q* (xm))  _  /  1  \  /  g  M(%m)  \  (E.4) 

~  VI  +  \l  + 


=  (E.5) 

With  a  solution  to  5/(/r(xrn))/3/w(xrn),  one  must  now  solve  for  d\i{xvn') / dw . 
Since /r(xm)  is  expressed  in  the  form  seen  in  (3.34),  dfj.(xm)/dw  can  be  solved  via  a 
quotient  rule, 


/ u\  vdu — udv 

9  VyJ  -  ^2  -  (E.6) 

where  the  derivative  of  both  the  numerator  and  denominator  must  be  computed  [276]. 
Per  (46),  v  —  ( dJ  +  dL),  u  =  ( d 1  —  dL),  and  v 2  =  ( d]  +  dL)2,  leaving  dv  and  du  to  be 
computed.  One  must  realize  that  dv  and  du  are  both  a  function  of  the  m-class  or  out-of¬ 
class,  w1  and  wL  respectively,  PV  gradient  descents,  therefore  computing  dv  and  du 
involves  solving  four  derivatives  to  yield  two  equations  for  the  in-class  and  out-of-class 
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gradient  descents,  and  ^jfor  w1  and  and  for  wL  respectively.  All  four 


derivatives  can  be  generally  expressed  as 

du  d(dJ  +  dL ) 
dwJ'L  dw^L 


and  the  derivatives  computed  via  the  sum  of  derivatives  rule, 

d(it  +  v)  —  du  +  dv. 

For  derivatives  associated  with  u,  (E.7)  can  be  expressed  as 

d(dJ  -  dL)  _  ddJ  ddL 

dwJ’L  dwl  dwL 


and  similarly  for  v  as 


(E.7) 


(E.8) 


(E.9) 


dv  d(d]  +  dL )  dd1  ddL 


+ 


(E-10) 


dw]'L  dw]'L  dw1  dwL' 

Obviously,  depending  on  whether  these  derivatives  are  computed  for  dw1  or  dwL,  one  of 

these  components  will  equal  zero  and  the  other  will  be  computed  via  the  derivative  of  the 
distance  metric.  Therefore,  the  GLVQ  gradient  derivative  formulation  can  be  simplified 

to  the  following  two  general  equations,  d  and  d  which  is  simplified  since 
dv1  —  du1  and  dvL  —  —duL, 


01  udu1  —  vdv1 
=  - 


(E- 11) 


and 


©L  uduL  —  vdvL 

= - 1 - 

v 


(E.12) 


this  can  further  be  simplified  to: 
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01  duJ(u  —  v) 

= - h - - 

vz 


(E.13) 


/U\L  duL(u  +  v) 

a(-)  = — -  — . 

\v  1  vz 

Inserting  our  expressions  for  u  and  v  into  (E.13)  and  (E.14)  yields, 


n y  du]((d]  -  dL)  -  ( dJ  +  dL))  guJ ( — 2dL) 
\v)  ( dl  +  dL)2  (dd  4-  dL32 


(. dl  +  dL)2 


(E.14) 


(E.15) 


^fu^L  duL((dJ  -  dL)  +  (dJ  +  dL)^J  duL(2d] ) 
d\v)  =  0 dJ  +  dLy  =  (dJ  +  dLY 


(, dl  +  dL)2' 


(E.16) 


which  provides  the  negation  to  make  the  in-class  PV  operation  move  closer  and  the  out- 
of-class  PV  move  further  away.  From  this  formulation,  and  assuming  one  doesn’t  change 
the  cost  function  itself,  to  change  distance  metrics  one  must  merely  compute  the  first 
derivate  of  the  respective  distance  metric  with  respect  to  both  the  in-class  and  out-of¬ 
class  PV  and  insert  it  appropriately.  If  one  has  examined  changing  distance  metrics  in  the 
LVQ  process  first,  then  one  only  needs  to  consider  the  computed  first  derivative  and 
appropriately  add  superscripts  to  designate  in-class  and  out-of-class  distance. 

For  the  nominal  squared  Euclidean  distance  metric,  this  is  solved  via  the  chain 
rule  and  hence  all  derivatives  are  multiplied  by  -1  due  to  the  negative  w  tenn.  One  can 
then  solve  (E.15)  for  duJ 


,  du  dd]  ,  . 

du]  —  — — 7  =  — — 7  —  2{xm  —  ■  —1  —  —  2{xm  —  wJ ) 

ow1  owJ 


(E.17) 


and  for  dv 
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(E.18) 


,  dv  dd1  .  . 

dv1  =  - — -  —  - — -  —  2{xm  —  wJ)  ■  — 1  -  — 2(xm  —  wJ)  . 
dwJ  dwJ 


Then  (E.16)  can  be  solve  for  dii 


duL 


tin  ddL 


dwL  dwL 


—2{xm  —  wL )  ■  —1  =  2(xm  —  wL)  (E.19) 


and  dv 


.  dv  ddL  .  .  (E.20) 

dvL  =—r  =  —r  =  2(xm  -  wL)  ■  -1  =  —2(xm  -wL).  [  J 

dwL  dwL 

To  compute  the  equation  for  the  gradient  descent  updates,  one  must  place  the 
appropriate  components  into  (E.6)  for  in-class  or  out-of-class  gradient  descents,  w1  and 
wL  respectively  denoted  as  dJ  and  dL,  yields 

rU\l  (dJ  +  dL)(— 2(xm  —  w^))  —  ( dJ  —  dL)(— 2(xm  —  wJ) ) 


’0  = 


(dJ  +  dLy 


(E.21) 


and 


u\L  ( d]  +  dL)(2(xm  —  w]f)  —  (dJ  —  dL)(— 1  *  2(x 


m  -W] 


')) 


(df  +  dL)2 


(E.22) 


which  can  be  expressed  as 


/u-J  -2(xm-w/)((dy +  dL)-(d;-dL)) 


(dJ  +  dL)2 


(E.23) 


and 


/U\L  2{xm  —  wL)(^(dJ  +  dL)  +  (dJ  —  dL)^ 

Vi?/ 


(df  +  dL)2 


(E.24) 


which  further  reduces  to 
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(E.25) 


d 


— 2(xm  —  wJ)(2dh ) 
(cU  +  dL)2 


and 


which  yields, 


2(xm-wL)(2d') 
(dl  +  dL)2 


4(xm  —  wJ)dL 
0 dJ  +  dL)2 


and 


(E.26) 


(E.27) 


/it\L  4(xm  —  wL)dJ 

dU  = 


(E.28) 


(dJ  +  dLy  ’ 

which  is  the  derivative  of  the  distance  used  in  the  quotient  rule,  within  the  chain  rule. 
The  gradient  descent  for  GRLVQ  type  algorithms  is  then  the  gradient  by  chain  rule 


d/(^(xm))  4  dJ,L 

dcO  0 dJ  +  dL)2 

multiplied  by  the  learning  rate,  e(t),  and  a  differential  shifting, 

(xm  —  WJ'L  ), 

which  yields  the  gradient  descent  equations  in  (3.38). 


(E.29) 


(E.30) 
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APPENDIX  F:  Gradient  Descent  in  GRLVQ  and  GRLVQI  Relevance  Computation 


Those  who  are  good  at  archery  learnt  from  the  bow  and  not  from  Yi  the  Archer.  Those 
who  know  how  to  manage  boats  learnt  from  the  boats  and  not  from  Wo.  Those  who  can 
think  learnt  from  themselves,  and  not  from  the  Sages. 

-Anonymous  (T’ang  Dynasty)1 


For  GRVLQ  and  GRLVQI,  the  relevance  computations  and  relevance  gradient 
descent  must  be  considered.  GRLVQ  and  GRLVQI  extend  GLVQ  in  a  similar  manner  as 
RLVQ  extends  LVQ.  Thus  the  PV  update  in  GRLVQ  and  GRLVQI  are  consistent  with 
the  gradient  update  in  Section  5.2.4,  and  the  relevance  computation  in  GRLVQ  and 
GRLVQI  is  associated  with  a  gradient  descent.  As  in  Section  5.2.2.2(a),  this  is  a  function 
of  ipq  and  it  would  be  computed  as  df(n(xm))/dip,  or 

dip  d/j.(xm )  dip 

d  f  (  ll(  x^'>)')  _ 

with  q  already  solved  for  the  PV  update,  in  (D.2)-(D.5).  Therefore,  solving  (F.l) 


involves  solving  — — — ,  which  involves  a  logically  similar  approach  to  solving 

for  d/i(xm)/3w,  via  the  quotient  rule  in  (E.6),  only  with  v  —  (c^  +  dL),  it  =  ( dJ  —  dL), 
and  v 2  =  ( dJ  +  dL)2,  for 


du  _d(dJ  -dL)  _ddJ  ddL 
dip  dip  dip  dip 


(F.2) 


and  for  v 


’From  the  8th  Century  Taoist  book  Kuan  Yin  Tzu 


263 


dv  d(dJ  +  dL)  dd]  ddL 

—  = - = - 1 - .  (F.3) 

dip  dip  dip  dip 

For  the  nominal  squared  Euclidean  distance  equation,  components  of  (F.2)  and 
(F.3)  can  be  solved  as 

dd J  2  2 

—  =  ip  ■  0  +  1  ■  {xiq  (t)  -  wnq  (t))  =  {xiq  (t)  -  wnq  (t))  (p  4) 


ddL  ,  n2  /  n2 

—  =  ip-0  +  l-  ( xiq(t )  -  wnq{t))  =  ( xiq(t )  -  wnq{t) )  . 


Since 


dd]  _  ddL 
dip  dip 


du 

T7  =  0  . 

dip 


then,  for  dv,  we  can  arrive  at  the  solution: 


dv  ,  2 

-gPp  ~  2(xjg(t)  —  wnc?(t))  . 


Putting  this  together  and  solving  for  d  (^j  via  the  quotient  rule  yields  the  following, 


(U\  2 (dJ  -  dL)(xiq(t)  -  wnq(t)y 

Vi;/  (dJ  +  dL)2 


which,  yields  a  PV  update, 
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(F.10) 


%(.*  +  1)  =  %(?) 
-  e(0/'U(^m)  ( - 


2 (dJ  -  dL)(xjq(t)  -  wnq(t)Y 
(, dJ  +  dL)2 


which  is  equivalent  to  the  GRLVQ  update  in  (3.37)  prior  to  being  multiplied  and  written 
out. 


Because  the  improvements  in  GRLVQI  consist  of  scalar  learning  rates  and  criteria 
outside  the  distance  metric  and  cost  function,  the  PV  update  process  is  not  different  from 
that  of  GRLVQ.  Therefore  the  PV  update  process  presented  for  GRLVQ  and  GLVQ  can 
be  directly  applied. 
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APPENDIX  G:  Cost  Function  Extensions  for  the  GLVQ  Family  of  Algorithms 


A  'simple  analysis'  can  be  harder  than  it  looks... 

-Christopher  Chatfield 


From  Sections  5.2.2.2(b)  and  5. 2.2. 4  it  is  known  that  not  all  derivatives  need  to  be 
recomputed.  Since  changing  p(xm)  does  not  change  the  cost  function  expression  in 
(3.34),  then  only  the  derivative  for  the  second  part  of  (E.l),  dp(xm>)/dw ,  must  be 
recomputed.  Again,  following  the  quotient  rule  in  (E.6),  we  determine  the  respective 
quantities  for  (5.9)  as  v  —  (dJ)2  +  (dL)2  ,  u  —  (dJ)2  —  (dL)2  ,  and  v2  —  ((d^)2  + 
(dL)2)2,  with  again  dv  and  du  to  be  computed  for  the  respective  in/out  of  class  PVs. 
Then  the  process  in  Section  5.2.2.2(b)  is  repeated  to  arrive  at  new  PV  update  rules. 
Again,  four  derivatives  to  yield  two  equations  for  the  in-class  and  out-of-class  gradient 

descents,  and  ^jfor  wJ  and  and  for  wL  respectively.  Similar  to  the  general 

derivative  in  (E.7),  all  four  derivatives  can  be  generally  expressed  as 

du  3 «d’)2  -  (dLY) 
dwi’L  dwJ’L 

with  the  derivative  for  u  expressed  as 


9  ((rfI)2  “  (rfL)2)  _  d(dJ)2  d(dL)2 
dwJ’L  dwl  dwL 


(G.2) 


and  the  derivative  for  v  expressed  as 
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dv  d((d')2  +  (dL)2)  d(d02  +  d(dL)2 


(G.3) 


dw]’L  dw]’L  dw]  dwL 

Consistent  with  5.2.4,  if  dw]  or  dwL  is  of  interest  one  of  these  components  will  equal 

zero  and  the  other  will  be  computed  via  the  derivative  of  the  distance  metric.  Since  the 
GLVQ  gradient  descent  formulation  has  not  been  altered,  we  can  use  the  quotient  rule 
derivatives  in  (E.13)  and  (E.14)  to  insert  our  expressions  for  u  and  v  into  (E.13)  and 
(D.14)  yields, 

(V i\J  3u/(((d/)2  —  (dL)2)  —  ((dy)2  +  (dL)2))  duJ(— 2(dL)2) 


. 


((d')2  +  (dL)2)2 


((d')2  +  (dL)2)2 


(G.4) 


and 


u\L  3itL(((d^)2  —  (dL)2)  +  ((d^)2  +  (dL)2))  duJ  (2(d-/)2) 


= 


((d')2  +  (dL)2)2 


((dO2  +  (dL)2)2' 


(G.5) 


Next,  one  can  then  solve  (E.15)  for  3 1/ where  the  differential  shifting  for  the  duL 
and  3ifE  firstly,  we  compute 


and  for  dvJ 


du]  — 


du 
dw ] 


3((dy)2)  _  d(xm-wjy 
dw 1  dw1 

—  4(xm  —  wjy  •  —i  —  —4(xm  —  wjy 


(G.6) 


dv] 


3  i 

dwJ 


3(dL)2  _ 
3uE 

=  —4{xm 


4(xm  —  w]y  ■  — i 

-w]y. 


Then  (E.16)  can  be  solved  for  duJ 


duL  — 


du 

dwL 


d(dJ): 


dwL 
=  4  (xm-wL)3 


—  —4(xm  —  wL ) 


L\ 3 


and  dv 


(G.7) 


(G.8) 
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,  dv  d(dL )2  , 

dvL  -  — — —  =  — — —  =  4(xm  —  wL)3  ■  —  1 
owL  dwL 

—  —  4(xm  —  wL )3. 


(G.9) 


Assembling  all  of  these  components,  one  can  fully  extend  to  a  PV  update  equation 

8emdf/d^xm))dL_^m_wJ^ 

(G.10) 


w\t  +  1)  =  wJ(t )  + 


(dJ  +  dL)2 


wK(t  +  1)  =  wL(t) 


8 


C xm-wL)3 . 


(df  +  dL)2 

which  differs  from  the  PV  updates  in  (3.35)  only  by  the  scalar  multiplier  and  the  squared 


terms  in  the  relative  distance  difference  equations. 
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APPENDIX  H:  Relevance  Derivatives  for  GRLVQI 


Remember  it  takes  time,  patience,  critical  practice  and  knowledge  to  learn  any  art  or 

craft. 


-Lloyd  Reynolds,  1902-1978 


As  previously  noted  in  Sections  3.3. 1.4,  3. 3. 1.6,  5.2.2.2(a),  and  5.2.2.2(b), 
relevance  learning  in  LVQ  algorithms  involves  a  further  gradient  descent  operation. 
Therefore,  when  considering  alternative  distance  measures  for  GRLVQ  and  GRLVQI, 
the  relevance  computations  and  relevance  gradient  descent  must  be  considered.  As  in 
RLVQ,  the  relevance  computation  in  GRLVQ  and  GRLVQI  is  associated  with  a  gradient 
descent;  therefore  to  compute  the  GRLVQ  and  GRLVQI  update  equations,  we  must 
revisit  the  gradient  descent  computations  in  Section  5.2.2.2(b)  using  the  gradient  update 
in  (G.10)  and  relative  distance  difference  (5.9).  Again,  as  in  Section  5.2.2.2(a),  if  this  is  a 
function  of  the  ipq,  then  it  would  be  computed  as  df(p.(xm))/dxp,  or 

a/QfOT)  df(pyp)di4xy) 

dw  dp.(xm )  dxp 

with  q  (x,n^  already  solved  for  the  PV  update,  in  (E.2)  to  (E.5).  Therefore,  solving 


(F.l)  involves  solving—^ — ,  which  involves  a  logically  similar  approach  to  solving  for 

/ dw ,  via  the  quotient  rule  in  (E.6),  only  with  v  =  +  dL),  it  =  ( dJ  —  dL), 

and  v 2  —  ( dJ  +  dL)2,  for 
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and  for  v 


du  _d(dJ  -dL)  _ddJ  ddL 
dip  dip  dip  dip 


dv  d(dJ  +  dL)  dd 1  ddL 
dip  dip  dip  +  dip 


(H.2) 


(H.3) 


For  the  nominal  squared  Euclidean  distance  equation,  components  of  (F.2)  and 
(F.3)  can  be  solved  as 


dd^  2  2 

—  =  ip-0+l-  (xiq  (t)  -  wnq  (0)  =  (xiq  (t)  -  wnq  (0) 


(H.4) 


Since, 


ddL  .  n2  /  n2 

—  =  ip-0  +  l-  {xiq  (t)  -  wnq{t))  =  \xiq  (t)  -  wnq  (t))  . 


(H.5) 


ddJ  _  ddL 
dip  dip 


(H.6) 


then,  for  dv,  we  can  arrive  at  the  solution: 

dv  , 

—  =  2  (xiq(t) 


Wnq(t))  ■ 


(H.7) 


(H.8) 


Putting  this  together  and  solving  for  d  via  the  quotient  rule  yields  the  following, 


2 (dJ  -  dL)(xiq(t)  -  wnq(t)Y 

W-'  (dJ  +  dL)2 


(H.9) 


which,  yields  a  relevance  update, 
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which  is  equivalent  to  the  GRLVQ  update  in  (3.38)  prior  to  being  multiplied  and  written 
out. 
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APPENDIX  I:  Review  of  Distance  Measures 

One  accurate  measurement  is  worth  a  thousand  expert  opinions. 

-Admiral  Grace  Hopper,  1906  -  1992 

Various  distance  metrics  exist,  with  some  literature  offering  comparisons.  Jones 
and  Fumas  [594]  compared  the  inner  product,  cosine  measure,  pseudo-cosine  measure, 
dice  measure,  produce-moment  correlation  and  covariance,  and  overlap  measure.  Zhang 
and  Korfhage  [595]  offered  further  analysis  of  the  cosine  measure.  Both  Cha  [283]  and 
McGill  et  al.  [596]  produced  a  review  of  distance  measures,  in  general  these  reviews 
overlapped  each  other  except  McGill  included  binary  distance  metrics.  From  these 
sources,  the  following  review  of  distance  metrics  was  produced;  below,  P  and  Q  are 
considered  to  be  two  different  vectors  of  equal  length,  n. 

Cha  [283]  considers  the  Minkowski  family  to  have  four  measures,  all  of  which  are 
special  cases  of  the  general  Minkowski  distance, 


which,  for  p  =  2,  is  the  Euclidean  L2  distance 


d-Euci  — 

City  Block,  for  p  =1, 
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(1.3) 


deity  =  I,i=i\Pl  ~  Ql\, 

and  Chebyshev,  for  p  =  oo, 

dcheb  =  maxil Pt  -  Qt\. 

The  L  i  family  of  measures  includes  many  measures,  which  are  variations  on  the 
City  Block,  Lh  measure  through  division  or  scaling.  Due  to  the  various  methods 
involves,  the  Z/  family  deserves  some  consideration.  The  Sorensen  measure  [284], 

ir=il  Pj-Qi\  (1.5) 

Sor  Z?=1(Pt  +  Qd 

is  typical  of  the  Z/  [283].  The  Gower  distance  metric  is  merely  a  scaling  of  dcity  by  a 
scalar  and  is  hence  differs  from  dcity  by  only  a  magnitude  [283],  for  this  reason  it  is  not 
examined  herein.  The  Soergel,  dsg  ,  and  Kulczynski,  dkd  ,  measures  are  similar 
approaches  are  variants  of  Sorensen  with  the  maximum,  max{Pi,  (),),  or  minimum, 
Yu=i  min(Pi,Qi),  in  the  denominator,  respectively  [283].  As  noted  by  Cha  [283],  the 
Canberra  measure  differs  from  Sorensen  through  normalizing  the  absolute  difference  of 
the  individual  level, 


d 


Can  ~ 


Y  I  Pi  -  Qi\ 

4  Pi  +  Qi' 

1  =  1 


The  Lorentzian  measure, 


(1.6) 
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(1.7) 


dL0r  =^ln(l  +  \Pi  -Oil) 

i= 1 

applies  the  natural  logarithm  to  the  City  Block  measure,  with  the  addition  of  1  is  used  to 
avoid  computing  the  logarithm  of  zero  [283]. 

Many  of  the  intersection  family  of  distance  measures  are  Li  based  and  identical  to 
an  Li  distance  measure  through  a  division  or  subtraction.  Examples  include  the  Ruzicka 
measure, 

If=1  minffi,  Qf)  (1.8) 

RUZ  Ef=1max(  PuQt)’ 

which  appears  different,  but  is  essentially  dsg/dkd.  This  is  similar  for  the  Kulczynski 

1 

measure,  which  is  1  /dkd  ,  the  Intersection  measure,  which  is  -dcity  ,  and  the 

1 

Czenkanowski  measure,  which  is  identical  to  Sorensen,  and  Motyka,  which  is  -  dSor 

[283],  However,  some  other  Intersection  family  measures  are  different  enough  to  warrant 
evaluation,  including  Wave  Hedges, 

_y  \Pj~Qi\  (i.9) 

“  Z,ma  x(PitQty 

i=i 

and  Tanimoto, 


d 


Tani 


Z"=1(ma x(Pj,  Qj)  -  min (Pf,  QQ) 
Z"=i  max(P;,  Qi) 


(1-10) 


The  Inner  Product  family  are  a  group  of  measures  that  involve  computing  the 
inner  product,  P  ■  Q,  of  vectors  in  question  [283].  The  inner  product  measure, 
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(1-11) 


diP=P-Q=I?=1PiQt. 
reflects  this.  Many  of  the  measures  in  this  family  include  the  inner  product  computation 
along  with  other  components.  The  Harmonic  mean  scales  dIP, 


d 


_V  PjQi 
4-fPi+Qi 

1=1 


HM 


a  i2) 


Cha  [283]  presents  the  cosine  measure  as  the  inner  product  metric  with  a  further 
scaling  in  the  denominator, 


dcos  — 


EF=i  PiQi 


I^iPC 


(1.13) 


A  variant  on  the  cosine  measure  is  the  pseudo-cosine  measure 


d 


E”=i  PiQi 


PCOS  - 


(1.14) 


I^iPiYliQi 

which  differs  from  the  cosine  measure  in  how  it  measures  vector  length  [594],  Cha  [283] 
also  presents  the  Kumar-Hassebrook  metric,  another  extension  of  the  cosine  measure, 


d 


Z?=i  PiQi 


Kumar  H  ^ n 


(1.15) 


Y.UP? +Vl=iQi -ZUPiQi 


Jaccard, 


djac 


ZU(.Pi  -  QiY 


J&iPC  +  IkiQi-'ZUPtQ 


and  Dice  [597], 


(1.16) 
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(1.17) 


,  Z?=i (Pi  ~  Qd2 
Dice  ^Pt  +  YhQ? 

measures  are  also  related  to  the  inner  product  family  [283]. 

The  Fidelity  family  appears  similar  to  the  Inner  Product  family;  however,  these 
include  natural  logarithms  and  square  roots  in  the  distance  computations.  While  these 
could  sufficiently  alter  the  distance  metrics,  these  could  also  present  problems  when 
negative  values  are  introduce  and  thus  cause  imaginary  numbers  to  be  computed. 
Therefore  these  will  not  be  considered,  but  are  presented  for  completeness.  The  basic 
measure  in  this  family,  Fidelity  is  the  Inner  Product  distance  with  a  square-root, 

n 

=  (L18) 

i= 1 

Bhattacharyya  is  an  Fidelity  family  type  of  measure  and  is  the  natural  log  of  dFid, 


n 

d^Bhat  —  ~Ln  PiQi- 
i= 1 


(1.19) 


Hellinger  involves  a  scaling  of  inner  product, 


d 


Hell 


n 


=  2 


N 


i=l 


(1.20) 


Matusita  involves  a  further  scaling, 
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(1.21) 


However,  Squared-Chord, 


d-Mat 


n 

;= 1 


dsc^ (yfPi-yfQ^2 

i= 1 


(1.22) 


offers  a  variation  on  the  fidelity  measure  and  appears  identical  to  dFid  by  an  offset, 


1  -  dsc  =  2£f=1  Jp-Qt-  1  =  2  dFid  -  1  [283], 

The  Squared  L2  family  offers  squared  variations  on  Euclidean  distance,  including 
the  squared  Euclidean  distance  of  (1),  in  addition  to  other  variations.  These  variations 
could  cause  metrics  to  produce  different  results,  hence  some  should  be  investigated.  The 
Pearson  %  and  Neyman  yj  metrics  are  similar  and  differ  in  the  denominator, 


_  y  (Pj-QiY 

If  Qi 

1  =  1 


lPX 


(1.23) 


and 


d 


-1 


o Pi-Qi y 


nx  ^  p,- 

;= i 


(1.24) 


respectively  [283].  The  Squared  y  further  extends  these, 


d 


_  y  (Pj-QiY 

“4  Pi  +  Qi 

1  =  1 


sx2 


(1.25) 


2 

and  the  probabilistic  symmetric  y  measure  is  2dSx2  [283].  The  divergence  measure, 
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(1.26) 


_  V  Vt-Qd2 

Div  L  (Pt  +  Qty 

1=1 


further  extends  dSx2  [283].  Clark, 


d-Clark  — 


2 

and  additive  symmetric  yj 


d-ASx2 


1 


(Pi-Qd2(Pi  +  Qd 

PiQi 


further  complete  the  squared  L2  family  [283]. 


Shannon’s  entropy  family  includes  additional  metrics 


other  families,  including  Kullback-Leibler, 


d-KL 


I 


Pi  In 


Qi 


Jeffreys, 


d 


Jeff 


i= 1 


K  divergence, 


dKd 


I 


Ptln 


2  Pi 

Pi  +  Qi 


Topsoe 


(1.27) 


(1.28) 


not  encompassed  in  the 


(1.29) 


(1.30) 


(1.31) 


278 


d‘°’  =  UpHjrh)  +  Q‘ln(ww) 


(1.32) 


Jensen-Shannon, 


♦!  «*&&) 


(1.33) 


and  Jensen  difference, 


I  L 

■IP 


PilnPi  +  QilnQi  Pt  +  Qt  Pt  +  Q, 

- 2 - 2 — /n_ 


(1.34) 


Cha  [283]  also  presents  a  family  of  combinations,  distance  measures 
incorporating  concepts  and  parts  of  multiple  measures.  This  family  includes  Taneja, 


J  V  \Pl  +  ^i,  pi  +  Qi 


(1.35) 


Kumar-  J  ohnson. 


I  L 

=  l 


(pi  -  Qfy 


KJ  feV2(p^)3/2/ 


and  the  average  of  Lp  for  p  =  [1,  oo], 


(1.36) 


dAvc  =  pi  ~Qi\  +  max| Pi  -  Qi\ 


(1.37) 


A  further  group  of  distance  measures,  termed  vicissitude,  includes  additional 


variations  of  other  metrics.  This  family  includes  Vicis-Wave  Hedges, 
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(1.38) 


,  _y  \Pj~Qi\ 

V-Wave  2_i  min(Pj,  Qi)’ 

1  =  1 

three  variations  of  Symmetric  %  which  differ  from  the  Squared  Lo  family  in  the 
denominator  with  the  denominator  of  dDiv  replaced  with  either  min(Pi(  QQ,  min^,  Qi)2, 
or  max(Pt,  Qt)  [283].  The  final  mentioned  vicissitude  metrics  include  max-symmetric  %  , 


dmaxSymx2  max 


n  ( Pt-Qi )2  X'iPi-Qi)2' 


vi= 1  i= 1 


(1.39) 


Qi 


and  min- symmetric  yf 


dminSymx2  ~  m^n 


y  (Pj  -  Qi)2  V  ~  Qi)2N 
A  Pi  'A  Qi 


(1.40) 


Although  not  listed  in  Cha’s  review,  Jones  and  Furnas  [594]  also  present  the 
following  equations  for  covariance  metric, 


1 1 

i= 1 


(1.41) 


with  P  and  Q  representing  the  means  of  P  and  Q ,  and  the  correlation, 


I?=iC Pt-PXQt-Q ) 


LCorr 


ZUiPi-P)  JZIM-Q) 


(1.42) 


distance  metric  [594],  Additionally,  the  Mahalanobis  statistical  distance  metric  was 
covered  in  these  reviews,  but  could  be  useful.  The  nominal  Mahalanobis  distance 
equation  is 
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(1.43) 


dManal  =  JcPi-PyS-HA-P), 

where  S  is  the  data  covariance  matrix  [598].  Mahalanobis  distance  can  be  extended  to  a 
similarity  between  two  vectors  through 


dMahalix.y)  =  V (Pt  -  Qd' S~\Pt- ~  Qd,  (  ) 

where  5  is  a  pooled  covariance  matrix.  For  use  herein,  squaring  (1.44)  would  be  more 
practical  to  remove  the  square  root  for  derivation  simplicity. 
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APPENDIX  J:  Derivatives  and  Prototype  Vectors  Updates  for  Selected  Distance 

Metrics 


There  is  a  measure  in  all  things. 

-Horace,  65BC  -  8BC 

In  this  appendix,  derivatives  for  the  distance  measures  selected  in  Section  5.3.1 
are  fonnulated.  Derivatives  for  relevance  measures  discussed  in  Section  5.3.4  are  also 
considered  as  needed  here.  Per  the  formulation  of  the  cost  functions  in  LVQ  algorithms, 
derivatives  of  distance  measures  and  metrics  are  made  with  respect  to  the  PV,  w,  or  for 
the  relevance  vector,  if/,  when  relevance  components  of  LVQ  algorithms  are  being 
considered. 


7.1  Cosine 

If  one  considers  that  the  denominator  of  the  cosine  measure  in  (1.13)  is  a  scalar, 
then  we  can  consider  the  cosine  measure  as 


d 


cos 


Np 

-1 


XiWi 


(J.l) 


where  the  derivative  can  then  be  computed  via  the  quotient  rule,  (E.6),  with  u  —  xtwt, 


v  —  l£”_i  wf,  and  the  then  for  the  derivative  with  respect  to  w:  du  —  Xj  and 


Mi  xf 

dv  —  ,  Wj.  Therefore  the  derivative  via  the  quotient  rule  is, 
y?,w? 

^-*1=1  l 
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ddr 


nf 


dw 


-I 

i= 1 


Xijxtixfjx^wf-XiWi  jzr=ixf/jzr=iwi 
•|n  v2  vn 


(J.2) 


The  Cosine  distance  measure  with  relevance  learning  can  be  formulated  a 

Np 


d-cos.xp  —  ^ 


IpiXiWi 


i  =  1, m*lXf  JE?=1W; 


(J.3) 


Per  the  quotient  rule,  (E.6),  with  it  =  i/ijXjWj,  v  =  xf  jEf=  i  w? ,  and  the  then  for 

the  derivative  with  respect  to  (//:  du  —  XjWj  and  dv  —  0,  then 

Np 


ddr 


dip 


4=I 


XiWi 


(J.4) 


7.2  Sorensen  and  Canberra 

Sorensen  and  Canberra  are  similar  expressions.  Considering  the  prototype 
vectors  and  exemplar  data,  Sorensen,  from  (1.5),  is  defined  as 

_IliI1xi-wi 

aSor  jvf 

Zi=1Xi+Wi 


(J.5) 


and  Canberra,  from  (1.6),  is  defined  as 


Np 


d 


Can 


ZXj-Wj 
Xi+Wi 

1=1 


(J.6) 


with  the  underlying  difference  being  that  Sorensen  considers  a  ratio  of  sums  whereas 
Canberra  considers  a  sum  of  ratios.  However,  while  the  distance  measures  produce 
different  distances  (which  were  uncorrelated  per  the  discussion  in),  both  have  similar 


derivations  with  respect  to  8/dw.  For  both  Sorensen  and  Canberra  it  =  xt  —  wt,  v  —  Xj  + 
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Wj,  and  the  then  for  the  derivative  with  respect  to  w:  du  —  —  1  and  dv  —  1.  Therefore 
the  derivatives  via  the  quotient  rule  are 

ddsor  =  ifii  -2xt 

dw  +  w)2  1  ' 

and 


dd 


Can 


Np 

-L 


—  2  X; 


dw  Z—i  (x  +  w)2 

i= 1 


(J.8) 


Due  to  both  Sorensen  offering  consistent,  albeit  slightly  less,  perfonnance  than  Canberra 
in  LVQ  and  the  relative  difficulty  of  introducing  a  relevance  tenn  into  the  Sorensen 
expression,  only  Canberra  was  further  considered  for  RLVQ,  GLVQ,  GRLVQ,  and 
GRLVQI.  To  implement  relevance  learning,  the  relevance  must  be  added  so  that  it 
multiplies  to  each  feature 

Np 


Zxi 
i= 1  1 


~  Wj 
i  +  l Vt 


(J.9) 


which  means  u  =  xpi(xi  —  Wj),  v  —  xt  +  and  the  then  for  the  derivative  with  respect 

to  y/\  du  —  (X[  —  W[),  and  dv  =  0.  The  resulting  derivative  is  therefore, 

nf 


ddCan,ip  \  '  xi 

dip  Z-i  X; 


W; 


i= 1 


i+Wi 


(J.  10) 


7.3  Pseudo-Cosine 

Considering  the  prototype  vectors  and  exemplar  data,  the  Pseudo  Cosine  measure 
of  (1.14)  becomes 
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d 


Pcos 


Np 

-I 


XiWi 


(J.ll) 


the  derivative  can  then  be  computed  via  the  quotient  rule,  (E.6),  can  be  used  to  compute 
the  derivative,  with  u  —  x-Lwh  v  —  xi  Ej=Ti  wi-  ar|d  the  then  for  the  derivative  with 
respect  to  w:  du  —  xt  and  dv  —  'Z^1  Xj.  Therefore  the  derivative  via  the  quotient  rule  is, 


Np 


dw 


-r- 

i= 1 


n  j  _ *  \-iN p  sr'Np  v-itv  17 

ddPcos  _STXi  2i=i  Xi  Wi  -  xtWi  Z;=1  X, 


-\N  p 


{YT^Xi^Wi) 


(J.12) 


7.4  Pearson  T 

2 

Considering  the  prototype  vectors  and  exemplar  data,  the  Pearson  %  measure  of 
(1.23)  becomes 

Np 


d 


Z(Xj  ~  Wj)2 
IV; 

;= i 


PXA 


(J.  13) 


the  derivative  can  then  be  computed  via  the  quotient  rule,  (E.6),  can  be  used  to  compute 
the  derivative,  with  u  —  (Xj  —  Wj)2,  v  —  wt,  and  the  then  for  the  derivative  with  respect 
to  w:  du  —  —  2(Xj  —  Wj)  and  dv  —  1.  Therefore  the  derivative  via  the  quotient  rule  is, 

Np 

„2  x  1  — 2xj(xj  -  Wj)  -  (Xj  -  Wj)2 


dd 


_pr 

dw 


i= 1 


w2 


(J.  14) 


7.5  Neyman  y2 

2 

Considering  the  prototype  vectors  and  exemplar  data,  the  Neyman  %  measure  of 
(1.24)  becomes 
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(J.  15) 


Z(Xj  ~  WiY 
xi 


the  derivative  can  then  be  computed  via  the  quotient  rule,  (E.6),  can  be  used  to  compute 
the  derivative,  with  u  —  (Xj  —  Wj)2,  v  —  xt,  and  the  then  for  the  derivative  with  respect 
to  w:  ddu  —  —  2(Xj  —  Wj)  and  dv  =  0.  Therefore  the  derivative  via  the  quotient  rule  is, 


ddNX2  _X'~2xi(xi-wi) 


(J.16) 


7.6  Additive  Symmetry 

Considering  the  prototype  vectors  and  exemplar  data,  the  Additive  Symmetry  % 
measure  of  (1.28)  becomes 


=1 


(Xi  -  Wi)2  (Xi  -  Wi) 


(J.  17) 


the  derivative  can  then  be  computed  via  the  quotient  rule,  (E.6),  can  be  used  to  compute 
the  derivative,  with  it  =  (Xj  —  Wj)2(Xj  —  Wj),  v  =  XjWj,  and  the  then  for  the  derivative 
with  respect  to  w:  du  —  —3 w2  —  2XjWj  +  x2  and  dv  —  x j.  Therefore  the  derivative  via 
the  quotient  rule  is, 
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(J.  1 8) 


ddASx2 

dw 


-I 


[xjWj(-3w?  -  2 XjWj  +  xf)  -  Xj(xj  -  WjYjXj  -  iv£)] 

(XiWj2 


7.7  Covariance 


The  covariance  measure,  (1.41),  involves  detennining  the  means  of  both  the  PVs 

and  data.  In  matrix  notation  one  can  express  (1.41)  as 

(  11  'x\  (  ll'w\  (  x'll'\(  11'wA  ,  _ 

dcov  =  ( x - — )  (w - —  )  =  (x - —  )(  w - — )  (J.  1 9) 


n 


n 


n 


n 


multiplying  expression  yields, 

je'll'w  x'll'w  x'll'll'w 

dcov  =x'w - + - = - 

n  n  n/ 

Taking  the  derivative  of  this  expression  yields, 


5d 


cov  Jr'll'  x'll'  x'11'11' 

=  x' - + - = — , 


dw  n  n 

which  can  be  simplified  algebraically  to 

SdCov 


rr 


dw 


where  /  is  an  identity  matrix  and  J  is  a  matrix  of  ones. 


(J.20) 


(J.21) 


(J.22) 


7.8  Squared  Mahalanobis 

As  illustrated  in  Section  5.3.3. 1,  Mahalanobis  distance  and  squared  Mahalanobis 
distance  are  perfectly  correlated.  Therefore,  for  use  herein,  squaring  (1.44)  was  assumed 
to  be  more  practical  to  remove  the  square  root  for  simplicity  in  derivations.  The 
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covariance  S  1  is  assumed  to  be  the  covariance  of  the  data.  In  matrix  notation,  the 
squared  form  of  (1.44)  can  be  expressed  as: 


dMahal(x,y )  =  (*  “  w)'S  V  ~  w), 

which  can  be  expressed  as 

dMahal(x,y )  =  (*'  “  w')5_1(*  ~  W). 

One  can  now  appropriately  distribute  the  covariance  matrix, 

dMahal(x,y)  =  (x'S^  -  w'S^^X  ~  W). 

which  expands  to 


d 


Mahal(x,y ) 


-  x's  1x  —  x's  1w  — w'5  1x  +  w'5  1w. 


which  has  the  first  derivative 


dd 


Mahal{x,y)  _  of.-i 


dw 


=  —25  —  w). 


(J.23) 

(J.24) 

(J.25) 

(J.26) 

(J.27) 


7.9  Harmonic  Mean 

When  related  to  example  data  and  PVs,  the  Hannonic  Mean  measure  in  (1. 12) 
becomes 


Np 


d 


HM 


Z_l  X; 


i- 1 


+  Wj 


(J.28) 


on  which  one  can  use  the  quotient  rule  in  (E.6)  to  compute  the  derivative  with  u  —  XjWj, 
v  —  Xj  +  Wj ,  and  the  then  for  the  derivative  with  respect  to  w:  du  —  x*  and  dv  =  1. 
Therefore  the  derivative  via  the  quotient  rule  is, 
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(J.29) 


nf 

ddHM  _s^xi(xi  +  wi)-xiwi 

dw  Z_i  (X;  +  Wj )2 
;= 1 
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APPENDIX  K:  Design  of  Experiments  Results 


Count  what  is  countable,  measure  what  is  measurable,  and  what  is  not  measureable, 

make  measurable. 


-Galileo  Galilea,  1564  -  1642 

Table  K-l  presents  design  of  experiments  results  for  the  cosine  GRLVQI, 
Canberra  GRLVQI,  and  Squared  Euclidean  GRLVQI  (baseline)  when  considering  all 
design  points  from  Table  V-6  for  Z-Wave  data.  In  Table  K-l,  factor  levels  correspond  to 
those  listed  in  Table  V-6  with  the  notation  of  for  a  low  setting,  “+”  for  a  high  setting, 
and  “0”  for  the  middle  setting. 


Table  K-l:  Design  of  Experiments  Results 


Factor 

Algorithm 

Cosine 

Canberra 

Squared  Euclidean 

A 

B 

C 

D 

E 

Train 

Test 

Mean 

Train 

Test 

Mean 

Train 

Test 

AUCC 

AUCC 

Auth. 

AUCC 

AUCC 

Auth. 

AUCC 

AUCC 

AUC 

AUC 

■■■ 

- 

- 

- 

13.22029 

13.2029 

0.974386 

8.788406 

7.846377 

0.476263 

14.68116 

14.84203 

0.736326 

- 

- 

+ 

13.20725 

13.22174 

0.96775 

8.773913 

8.068116 

0.572325 

14.79565 

14.74638 

0.713711 

- 

- 

- 

13.49275 

12.99565 

0.987486 

8.763768 

8.001449 

0.580113 

14.65797 

14.87681 

0.740485 

- 

- 

+ 

13.35797 

13.2913 

0.968299 

8.8 

8.042029 

0.53436 

14.68986 

14.77681 

0.690756 

- 

- 

- 

13.23623 

13.28986 

0.972098 

8.795652 

7.844928 

0.546301 

14.61884 

14.94058 

0.695009 

- 

- 

+ 

13.28116 

13.19565 

0.966144 

8.557971 

7.981159 

0.553403 

14.64783 

14.76957 

0.688217 

- 

- 

- 

13.42029 

13.36667 

0.96017 

8.775362 

8.078261 

0.513428 

14.77101 

14.78406 

0.686377 

- 

- 

+ 

13.31884 

13.22174 

0.970473 

8.724638 

8.004348 

0.566093 

14.64348 

14.75072 

0.693384 

- 

- 

- 

13.29565 

13.41884 

0.975728 

8.763768 

8.06087 

0.579855 

14.63333 

14.85942 

0.693422 

- 

- 

+ 

13.33188 

13.12319 

0.934934 

8.844928 

8.1 

0.628444 

14.23188 

14.47681 

0.658381 

- 

- 

- 

13.3029 

13.33043 

0.990454 

8.569565 

8.002899 

0.4754 

14.23623 

14.6058 

0.855892 

- 

- 

+ 

13.22174 

13.17391 

0.959168 

8.708696 

8.197101 

0.514171 

14.27536 

14.49855 

0.788733 

- 

- 

- 

13.23913 

13.15797 

0.954915 

8.673913 

8.163768 

0.543371 

14.30145 

14.51304 

0.838362 

- 

- 

+ 

13.44638 

13.57391 

0.948204 

8.913043 

8.308696 

0.518922 

14.22899 

14.21884 

0.839609 
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292 


13.03768 

12.91884 

0.973837 

8.807246 

8.088406 

0.595167 

14.5 

13.9029 

13.11159 

12.97971 

0.972136 

8.721739 

7.642029 

0.521222 

14.62174 

14.38551 

12.94783 

13.21304 

0.954726 

8.686957 

8.176812 

0.516982 

14.56812 

14.1087 

13.17971 

13.01884 

0.983913 

8.747826 

8.007246 

0.458147 

14.61304 

14.25072 

12.83623 

12.9971 

0.98104 

8.749275 

8.014493 

0.58402 

14.54493 

14.31014 

13.10725 

13.07391 

0.921626 

8.746377 

8.046377 

0.45804 

14.55507 

14.31449 

12.87536 

13.04493 

0.897826 

8.728986 

8.244928 

0.475079 

14.23768 

14.12899 

12.98551 

13.18116 

0.890095 

8.869565 

8.165217 

0.557001 

14.23623 

14.01594 

13.04783 

13.19855 

0.984839 

8.730435 

7.905797 

0.493239 

14.35217 

13.63768 

13.02174 

13.26667 

0.985992 

8.723188 

8.133333 

0.578544 

14.2087 

13.85652 

12.99855 

13.05362 

0.985142 

9.023188 

7.963768 

0.438185 

14.28696 

13.96812 

12.9913 

13.25072 

0.965369 

8.733333 

8.126087 

0.499408 

14.33333 

13.91594 

13.0913 

13.07971 

0.98673 

8.692754 

7.798551 

0.454726 

13.96377 

13.81884 

12.99855 

13.27391 

0.850473 

8.72029 

8.124638 

0.544921 

14.15217 

13.84058 

12.98986 

13.01884 

0.678544 

8.853623 

7.988406 

0.487026 

14.3058 

13.90145 

12.98261 

12.80145 

0.972042 

8.615942 

8.110145 

0.67828 

14.38696 

13.84493 

12.77101 

13.04493 

0.966616 

8.588406 

8.178261 

0.511424 

14.37101 

14.04638 

13.10435 

12.87246 

0.951096 

8.824638 

7.92029 

0.71988 

14.35362 

13.76232 

12.95362 

12.98116 

0.951664 

8.765217 

7.989855 

0.498009 

14.3971 

14.08406 

12.93043 

13.0942 

0.97155 

8.608696 

7.9 

0.533056 

14.36522 

14.13043 

13.12319 

12.93188 

0.898431 

8.757971 

8.047826 

0.508513 

14.43333 

14.00145 

12.93768 

13.30435 

0.892363 

8.689855 

7.837681 

0.529817 

14.26087 

13.86667 

13.06522 

12.78261 

0.947732 

8.766667 

7.985507 

0.543314 

14.42899 

13.99275 

12.8942 

12.97246 

0.957372 

8.694203 

8.028986 

0.470082 

14.46957 

14.22319 

15.05217 

14.99275 

0.989943 

8.724638 

7.788406 

0.525829 

14.91884 

15.01304 

15.24348 

15.3087 

0.993403 

8.82029 

8.068116 

0.511487 

14.87391 

14.63623 

15.26667 

15.12319 

0.994896 

8.818841 

8.107246 

0.449112 

14.67391 

14.38696 

15.13623 

14.90145 

0.993535 

8.795652 

8.036232 

0.78564 

15.21884 

14.88261 

15.2913 

15.01159 

0.989603 

8.714493 

8.02029 

0.570744 

15.35362 

15.3087 

15.21014 

15.05652 

0.995595 

8.649275 

8.014493 

0.793377 

14.97101 

14.67246 

15.02754 

14.88986 

0.994915 

8.723188 

8.024638 

0.677013 

15.17971 

15.07391 

15.1913 

15.12899 

0.995482 

8.630435 

7.72029 

0.722111 

15.23623 

15.11594 

15.12609 

15.44203 

0.995652 

8.844928 

7.982609 

0.438129 

15.29275 

15.31449 

15.1058 

15.22899 

0.993667 

8.86087 

7.913043 

0.669666 

14.36087 

14.2 

15.20145 

15.26522 

0.994915 

8.715942 

8.075362 

0.538834 

14.4058 

14.17536 

0.648815 

0.621109 

0.677032 

0.641078 

0.643869 

0.62414 

0.830662 

0.789137 

0.835854 

0.767832 

0.760038 

0.783793 

0.75644 

0.739382 

0.740964 

0.746553 

0.769962 

0.802684 

0.680725 

0.751474 

0.768425 

0.727839 

0.731235 

0.651159 

0.9177 

0.91264 

0.905325 

0.955734 

0.946049 

0.933951 

0.947467 

0.953214 

0.933106 

0.905608 

0.923113 
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HOD 


a 

ID 

m 


15.2087 


15.20435 


15.24493 


15.23623 


15.24783 


15.09855 


15.25797 


15.17826 


15.23478 


15.41304 


15.25217 


15.20435 


15.41014 


15.2942 


15.21014 


15.51884 


15.49855 


15.67826 


15.55652 


15.53623 


15.47391 


15.41159 


15.46957 


15.59275 


15.51159 


15.67246 


15.42029 


15.46812 


15.47826 


15.61159 


15.62319 


15.64638 


15.68116 


15.62029 


15.0971 


15.19855 


15.15797 


15.23188 


15.22174 


15.31884 


15.01014 


15.0971 


15.12609 


14.98696 


15.20725 


15.18116 


15.14058 


15.12174 


15.24638 


15.27246 


15.51014 


15.54638 


15.55652 


15.79855 


15.6029 


15.45797 


15.4942 


15.62319 


15.31014 


15.58406 


15.55072 


15.55652 


15.47826 


15.32174 


15.56812 


15.47536 


15.44058 


15.49275 


15.23478 


0.995652 

8.647826 

7.942029 

0.707763 

14.17536 

13.76522 

0.995652 

8.801449 

8.108696 

0.698355 

14.96232 

14.75217 

0.994858 

8.882609 

8.044928 

0.597379 

14.85507 

14.5971 

0.995388 

8.798551 

8.026087 

0.720252 

14.56087 

14.33188 

0.995652 

8.889855 

7.965217 

0.535696 

15.07536 

14.38841 

0.994159 

8.884058 

7.844928 

0.796497 

14.86522 

14.77826 

0.99552 

9.06087 

8.047826 

0.534127 

14.70435 

14.45072 

0.994972 

8.833333 

7.823188 

0.455041 

14.95652 

14.72754 

0.994405 

8.881159 

7.965217 

0.546541 

14.56377 

14.31884 

0.995369 

8.672464 

7.943478 

0.613611 

14.56667 

14.16522 

0.995085 

8.775362 

8.101449 

0.486761 

15.33188 

15.21884 

0.994026 

8.675362 

8.131884 

0.686957 

15.36667 

15.28696 

0.992552 

8.911594 

7.910145 

0.687618 

14.93478 

14.68261 

0.995028 

8.75942 

8.114493 

0.523428 

15.33913 

15.10435 

0.994442 

8.689855 

8.121739 

0.539351 

15.31159 

15.39275 

0.995406 

8.876812 

7.986957 

0.68138 

15.29565 

14.77536 

0.994216 

8.878261 

7.981159 

0.611853 

15.23043 

14.75217 

0.992533 

8.865217 

7.975362 

0.5177 

15.34058 

14.75217 

0.994253 

8.465217 

7.913043 

0.615482 

15.26812 

14.3942 

0.993081 

8.531884 

7.884058 

0.796043 

15.85362 

15.21739 

0.993403 

8.662319 

8.055072 

0.513711 

15.58696 

14.82899 

0.993913 

8.721739 

8.005797 

0.585167 

15.58261 

14.69565 

0.993875 

8.775362 

7.866667 

0.919086 

15.65072 

15.31594 

0.984008 

8.850725 

7.965217 

0.549572 

15.66957 

15.33623 

0.994631 

8.821739 

7.95942 

0.723686 

15.5058 

15.02029 

0.993176 

8.821739 

8.049275 

0.610315 

15.72609 

15.09855 

0.994178 

8.791304 

7.843478 

0.765142 

15.69275 

14.69855 

0.990548 

8.75942 

8.104348 

0.777372 

15.38116 

14.69855 

0.99482 

8.682609 

8.001449 

0.626068 

15.66232 

14.87971 

0.993648 

8.765217 

7.911594 

0.842054 

15.61739 

14.96957 

0.994348 

8.813043 

8.073913 

0.561103 

15.58986 

14.68261 

0.993648 

8.763768 

8.273913 

0.615406 

15.67826 

15.39565 

0.993667 

8.627536 

8.098551 

0.576522 

15.86377 

15.52899 

0.99327 

8.723188 

7.984058 

0.757328 

15.65072 

15.16377 

0.99552 

8.736232 

8.291304 

0.600989 

15.45652 

14.84783 

0.862029 


0.911229 


0.93765 


0.891487 


0.514852 


0.95528 


0.944644 


0.912483 


0.852848 


0.891771 


0.51443 


0.520586 


0.923409 


0.945904 


0.955986 


0.912703 


0.522741 


0.517297 


0.968053 


0.974348 


0.517133 


0.974008 


0.974858 


0.517694 


0.976736 


0.961582 


0.607057 


0.639326 


0.964354 


0.592035 


0.596673 


0.937883 


0.594127 


0.9715 


0.852489 
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+ 

+ 

a 

+ 

- 

15.39855 

15.61884 

0.993554 

8.830435 

8.118841 

0.767883 

15.37681 

15.04203 

0.75903 

+ 

+ 

+ 

+ 

15.65217 

15.1942 

0.993006 

8.836232 

8.310145 

0.718973 

15.35942 

14.90435 

0.969975 

+ 

+ 

+ 

- 

- 

15.47971 

15.16232 

0.978431 

8.737681 

8.127536 

0.77804 

16.01159 

15.42319 

0.970838 

+ 

+ 

+ 

- 

+ 

15.51449 

15.41449 

0.986522 

8.814493 

8.256522 

0.826307 

15.7087 

15.44928 

0.969036 

+ 

+ 

+ 

D 

- 

15.54058 

15.40725 

0.989773 

8.727536 

7.963768 

0.790945 

15.74928 

15.02899 

0.949389 

+ 

+ 

+ 

+ 

15.48261 

15.5087 

0.992779 

8.791304 

7.87971 

0.559748 

15.98986 

15.64493 

0.958551 

+ 

+ 

+ 

+ 

- 

15.57391 

15.44783 

0.981399 

8.589855 

8.228986 

0.708135 

16.06667 

15.71739 

0.644052 

+ 

+ 

+ 

+ 

+ 

15.56087 

15.46087 

0.994499 

8.927536 

8.104348 

0.877587 

15.74928 

15.37826 

0.975009 
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APPENDIX  L:  MEX  File  Programming  Considerations 

A  computer  is  like  an  Old  Testament  god,  with  a  lot  of  rules  and  no  mercy. 

-Joseph  Campbell,  1904-  1987 

For  efficiency,  MATLAB  mex  files  are  used  for  GRLVQI  implementation  on  RF- 
DNA  data.  Writing  mex  files  involves  understanding  both  Matlab  and  C  programming. 
Common  programming  issues  encountered  with  mex  files  included:  1)  improper 
distinctions  between  pointers  and  variables  in  the  mex  file,  2)  complexities  and 
differences  in  mathematical  programming  that  exist  between  Matlab  and  C. 

Additionally,  compiling  mex  files  appropriately  is  nontrivial.  While  the  below 
syntax  will  compile  a  mex  file,  not  all  mex  files  performed  equally  fast  and  hence  the 
computational  speed  of  a  mex  file  appears  to  have  a  connection  to  the  computer  and 
software  it  was  compiled  on.  Per  communication  with  Reising  [599],  for  debugging  and 
coding  considerations  one  should  compile  a  given  mex  file  via  the  following  commands: 

mex  —  g  —  vCOMPFLAGS 

(H.l) 

=  "$COMPFLAGS  -  Wall"  -  largeArrayDims  FILENAME,  c 
where  compiling  with  the  “-g”  command  enables  debugging  in  Microsoft  Visual  Studio 
[600], 

For  debugging  a  given  mex  file  one  should  consider  the  following  general 
process: 

1 .  Start  Matlab 

2.  Compile 
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3.  Start  Microsoft  Visual  Studio 

4.  Open  the  associated  c-file  in  Microsoft  Visual  Studio 

5.  Attach  Microsoft  Visual  Studio  to  the  Matlab  process 

6.  Insert  break  points  as  needed  in  the  c  file  (within  Visual  Studio) 

7.  Run  the  Matlab  algorithm  under  analysis. 

When  these  steps  are  followed,  one  will  find  that  Matlab  and  Visual  Studio  enable  rough 
debugging  abilities  of  mex  files. 
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APPENDIX  M:  GRLVQI-D  Performance  on  ZigBee  RF-DNA  Fingeprints  with  Z- 

Wave  Based  Optimization 

Beware  that  thou  be  not  deceived  into  folly,  and  be  humbled. 

-Sirach  13:10  (DRA) 

ZigBee  data  was  also  considered  using  the  optimized  Squared  Euclidean 
GRLVQI  and  the  optimized  Cosine  GRLVQI-D  algorithms.  However,  it  should  be  noted 
that  the  optimized  settings  are  only  optimized  per  Z-Wave  RF-DNA  fingerprints  and  thus 
no  guarantees  on  their  applicability  to  ZigBee.  Future  research  item  number  2,  in  Section 
7.3,  regards  using  the  Air  Force  Research  Laboratory  DOD  Supercomputing  Resource 
Center  (DSRC).  This  is  directly  connected  to  the  results  in  this  appendix.  Due  to 
computational  times  associated  with  the  larger  ZigBee  dataset  (when  compared  to  the  Z- 
Wave  dataset),  the  optimization  process  was  not  reconsidered  for  ZigBee  devices. 
Additionally,  since  the  Canberra  GRLVQI  algorithmic  results  generally  underperformed 
both  the  Squared  Euclidean  GRVLQI  and  Cosine  GRLVQI-D,  Canberra  GRLVQI-D  was 
not  further  considered  for  ZigBee  RF-DNA  Fingerprints. 

Figure  M-l  presents  training  (TNG)  and  testing  (TST)  classification  results  from 
the  baseline  Squared  Euclidean  GRLVQI  algorithm,  the  Squared  Euclidean  GRLVQI 
algorithm  using  the  Classification-optimized  settings  in  Table  V-9,  and  the  Squared 
Euclidean  GRLVQI  algorithm  using  the  Verification-optimized  settings  in  Table  V-9. 
Noticeably,  classification  performance  of  the  optimized  algorithms  appears  slightly  lower 
than  the  baseline  ZigBee  GRLVQI  performance.  The  Classification-based  optimized 
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Squared  Euclidean  GRLVQI  shows  an  improvement  in  gain  of -4.4  dB  (TNG)  and  -2.69 
dB  (TST)  at  90%  accuracy;  the  Verification-based  optimized  Squared  Euclidean 
GRLVQI  shows  an  improvement  in  gain  of-13.44  dB  (TST)  and  -10.48  dB  (TST). 


Figure  M-l:  ZigBee  GRLVQI  Classification  Performance  Using  Squared  Euclidean 
Distance  Using  Optimized  Algorithmic  Settings. 

Figure  M-2  presents  both  the  authorized,  Figure  M-2a,  and  rogue  rejected,  Figure 

M-2b,  verification  perfonnance  for  the  Classification-optimized  Squared  Euclidean 

GRLVQI  algorithm.  When  compared  with  baseline  performance,  presented  in  Table 

V-5,  the  Classification-optimized  Squared  Euclidean  GRLVQI  performance  has 
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improved  authorized  verification  performance  (50%  versus  25%),  but  reduced  rogue 
rejection  verification  perfonnance  (30.56%  versus  52.78%). 
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Figure  M-2:  GRLVQI  ID  Verification  Performance  of  ZigBee  in  Squared 
Euclidean  GRLVQI  using  Z-Wave  Determined  Classification-Based  Optimization 

Settings  at  18dB. 


Figure  M-3  similarly  presents  both  the  authorized,  Figure  M-3a,  and  rogue 
rejected,  Figure  M-3b,  verification  perfonnance  for  the  Verification-optimized  Squared 
Euclidean  GRLVQI  algorithm.  Noticeably,  performance  is  degraded  compared  to  the 
Classification-optimized  algorithmic  results  in  Figure  M-2.  When  compared  with 
baseline  performance,  presented  in  Table  V-5,  the  Classification-optimized  Squared 
Euclidean  GRLVQI  performance  has  worse  authorized  verification  performance  (0% 
versus  25%),  and  worse  rogue  rejection  verification  performance  (41.66%  versus 
52.78%). 
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Figure  M-3:  GRLVQI  ID  Verification  Performance  of  ZigBee  in  Squared 
Euclidean  GRLVQI  using  Z-Wave  Determined  Vefication-Based  Optimization 

Settings  at  18dB. 


Figure  M-4  presents  training  (TNG)  and  testing  (TST)  classification  results  from 
the  Cosine  GRLVQI-D  algorithm  in  comparison  with  the  baseline  Squared  Euclidean 
GRLVQI  algorithm.  Both  Cosine  GRLVQI-D  with  the  Classification-optimized  settings 
in  Table  V-9  and  the  Cosine  GRLVQI  algorithm  using  the  Verification-optimized 
settings  in  Table  V-9  are  presented.  Noticeably,  classification  perfonnance  of  the 
optimized  algorithms  appears  slightly  worse  than  the  baseline  ZigBee  GRLVQI 
performance  and  performance  never  reaches  90%  accuracy. 
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Figure  M-4:  GRLVQI  Classification  Performance  Using  Cosine  Distance  Using 

Optimized  Algorithmic  Settings. 


Figure  M-5  presents  both  the  authorized,  Figure  M-5a,  and  rogue  rejected,  Figure 
M-5b,  verification  performance  for  the  Classification-optimized  Cosine  GRLVQI-D 
algorithm.  When  compared  with  baseline  Squared  Euclidean  GRLVQI  performance, 
presented  in  Table  V-5,  the  Classification-optimized  Cosine  GRLVQI-D  performance  has 
comparable  authorized  verification  performance  (25%  versus  25%),  but  reduced  rogue 
rejection  verification  perfonnance  (47.22%  versus  52.78%). 


302 


False  Verification  Rate  (FVR) 


0  0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9  1 


Rogue  Accept  Rate  (RAR) 


a)  Authorized 


b)  Rogue 


Figure  M-5:  GRLVQI  ID  Verification  Performance  of  ZigBee  in  Cosine  GRLVQI 
using  Z-Wave  Determined  Classification-Based  Optimization  Settings  at  18dB. 


Figure  M-6  similarly  presents  both  the  authorized,  Figure  M-6a,  and  rogue 
rejected,  Figure  M-6b,  verification  performance  for  the  Verification-optimized  Cosine 
GRLVQI-D  algorithm.  Noticeably,  performance  is  slightly  degraded  compared  to  the 
Classification-optimized  algorithmic  results  in  Figure  M-5,  which  is  consistent  with  the 
observations  about  Squared  Euclidean  GRLVQI  in  Figure  M-2  and  Figure  M-3  .  When 
compared  with  baseline  performance,  presented  in  Table  V-5,  the  Classification- 
optimized  Squared  Euclidean  GRLVQI  performance  has  worse  authorized  verification 
performance  (0%  versus  25%),  and  worse  rogue  rejection  verification  performance 
(33.33%  versus  52.78%). 
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Figure  M-6:  GRLVQI  ID  Verification  Performance  of  ZigBee  in  Cosine  GRLVQI 
using  Z-Wave  Determined  Verification-Based  Optimization  Settings  at  18dB. 


Table  M-l  presents  an  overall  comparison  of  classification  and  verification 
performance  for  the  Squared  Euclidean  GRLVQI  algorithm  and  the  Cosine  GRLVQI-D 
algorithm.  Baseline  perfonnance  from  Table  V-5  is  also  included  for  comparison. 
Overall,  the  best  performance  is  seen  in  the  non-optimized  Squared  Euclidean  GRLVQI 
algorithms.  This  differs  from  the  result  seen  in  Section  5.4.3  when  the  Z-Wave  dataset 
was  considered. 
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Table  M-l:  GRLVQI  Performance  for  ZigBee  RF-DNA  Data  Using  Z-Wave 

Optimized  Algorithmic  Settings. 


Result 

Classification 

Verification  (18  dB) 

Algorithm 

Optimization 

Method 

RAP 

(TNG 

RAP 

(TST) 

SNR  Gain 
(DB)  AT  90%C 
Relative  to 
Baseline 
TST  (NPV=10) 

%Authorize 

D  OR  %ROGUE 
Rejected 

Mean  AUC 

TNG 

TNG 

Aut. 

ROG. 

Aut. 

ROG. 

None  -  Baseline 
Settings  (NPV  = 
10) 

0.99 

1.00 

-0.53 

0.00 

25% 

63.9% 

0.92 

0.93 

Squared 

Euclidean 

GRLVQI 

None  -  Baseline 
Settings  (NPV  = 
13) 

1.00 

1.01 

-0.11 

+0.5 

25% 

52.8% 

0.93 

0.94 

Classification- 

Based 

Optimization 

0.91 

0.93 

-4.93 

-2.7 

50% 

30.6% 

0.91 

0.87 

Verification- 

Based 

Optimization 

0.97 

0.99 

-13.9 

-10.5 

0% 

41.7% 

0.88 

0.90 

Cosine 

Classification- 

Based 

Optimization 

0.78 

0.82 

N/A 

25% 

47.2% 

0.85 

0.85 

GRLVQI-D 

Verification- 

Based 

Optimization 

0.87 

0.90 

N/A 

0% 

33.3% 

0.80 

0.81 
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